Automatic Analysis and Annotation of Document Keywords in Biomedical Literature

生物医学文献中文档关键词的自动分析与标注

基本信息

  • 批准号:
    8344960
  • 负责人:
  • 金额:
    $ 23.98万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
  • 资助国家:
    美国
  • 起止时间:
  • 项目状态:
    未结题

项目摘要

1) Electronic Textbook and PubMed Central Indexing Current processing of the electronic textbook material involves a number of steps designed to produce the most meaningful phrases in the text to be used as reference points. The first task is to identify grammatically reasonable phrases. We use a version of the Brill transformation based tagger, rewritten in C++, for part-of- speech tagging. This forms the basis for determining grammatically reasonable phrases. There is a significant post processing step that removes phrases that involve inappropriate references to context (e.g., different cells, final mutation). After finding grammatically reasonable phrases we attempt to eliminate those that are too common or generic to be useful (e.g., significant result, short time). The next step is to compare a phrase with previously rated phrases that have been collected over the life of the project. The final stage is to estimate the importance of a phrase in the passage where it is found in a textbook. Such an estimate is based on the frequency of the phrase and the size of the passage compared with the frequency of the phrase throughout the book and the overall size of the book. In order to improve such an estimate we attempt to take account of the phrase or any phrase that represents the same concept. For this purpose we use the UMLS Metathesaurus and also stemming and combine these two approaches into a consistent picture of the concept as it occurs in the text. The result of this processing is a scored list of phrase-book section pairs for each textbook. These are used to guide the response of general searching in the books. When a user types in a phrase that is on our curated list the first results given are the highly rated book sections for that phrase. We are now applying a similar indexing scheme to the text of articles in PMCentral. This allows us to give a list of highly rated phrases for each article as an enhanced reference point for searchers. 2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text. 3) Currently we are studying how good phrases can be recognized by their characteristics, such as frequency, tendency to be repeated in documents where they occur, and other numerical properties. These features allow one to predict which phrases are of high quality. We have found such predictions to be useful in studying different kinds of terms that may appear in text and how an ontoloogy might be extracted from text.
1)电子教科书和PubMed Central索引 电子教科书材料的当前处理涉及许多步骤,该步骤旨在产生文本中最有意义的短语,以用作参考点。第一个任务是确定语法合理的短语。我们使用基于Brill Transformation的标签器的版本,以C ++重写,用于语音标记。这构成了确定语法合理短语的基础。有一个重要的后处理步骤去除涉及不适当引用上下文的短语(例如,不同的细胞,最终突变)。找到语法合理的短语后,我们试图消除那些太常见或通用的短语(例如,显着结果,短时间)。 下一步是将一个短语与在项目一生中收集的先前额定短语进行比较。最后阶段是估计在教科书中找到的段落中短语的重要性。这样的估计是基于短语的频率和段落的大小,与本书中的短语频率和书的整体大小相比。为了提高这样的估计,我们试图考虑到代表相同概念的短语或任何短语。为此,我们使用UMLS Metathesaurus,并将这两种方法结合在文本中发生的概念的一致图片中。 该处理的结果是每个教科书的评分列表。这些用于指导书中一般搜索的响应。当用户在我们策划的列表中键入短语中,给出的第一个结果是该短语的额定额定图部分。现在,我们正在对Pmentral的文章进行类似的索引方案。这使我们可以为每篇文章提供高度评价的短语列表,作为搜索者的增强参考点。 2)PubMed中的查询很大一部分是多标准查询,PubMed通常将其作为术语的布尔连接来处理。但是,对PubMed中查询的分析表明,许多此类查询是有意义的短语,而不是简单的术语集合。我们已经检查了是否将这些查询解释为短语或查询术语的连词,它是否会有所作为。而且,如果这样做,则使用此类查询的搜索方式是什么。为了解决这个问题,我们基于机器学习技术开发了一种自动检索评估方法,使我们能够评估和比较各种检索结果。我们表明,包含所有搜索词的记录类,但不包含短语,与包含该短语的记录类别不同。我们还表明,差异是系统的,具体取决于查询术语在记录中相互彼此之间的接近性。基于这些结果,可以为记录建立最佳的检索顺序。我们的发现与近端搜索的研究一致。索引的重要见解是,在某些情况下,短语中出现在文本中而不是作为短语中,该短语仍然可能是一个适当的概念,用于索引文本。 3)目前,我们正在研究如何通过其特征(例如频率)在发生的文档中重复趋势以及其他数值属性来识别良好的短语。这些功能可以预测哪些短语具有高质量。我们发现,此类预测在研究文本中可能出现的不同类型的术语以及如何从文本中提取了本质的各种术语。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Willy Wilbur其他文献

Willy Wilbur的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Willy Wilbur', 18)}}的其他基金

A Document Processing System
文档处理系统
  • 批准号:
    8344939
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8558105
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
Natural Language Processing Techniques To Enhance Information Access.
增强信息访问的自然语言处理技术。
  • 批准号:
    8943224
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
PubMed Query Log Analysis and Use in Access Inhancement
PubMed 查询日志分析及其在访问增强中的使用
  • 批准号:
    7969244
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
Automatic Bayesian Methods In Text Retrieval
文本检索中的自动贝叶斯方法
  • 批准号:
    8149591
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    8149592
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8149602
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    9160906
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    7969199
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8344948
  • 财政年份:
  • 资助金额:
    $ 23.98万
  • 项目类别:

相似国自然基金

时空序列驱动的神经形态视觉目标识别算法研究
  • 批准号:
    61906126
  • 批准年份:
    2019
  • 资助金额:
    24.0 万元
  • 项目类别:
    青年科学基金项目
本体驱动的地址数据空间语义建模与地址匹配方法
  • 批准号:
    41901325
  • 批准年份:
    2019
  • 资助金额:
    22.0 万元
  • 项目类别:
    青年科学基金项目
大容量固态硬盘地址映射表优化设计与访存优化研究
  • 批准号:
    61802133
  • 批准年份:
    2018
  • 资助金额:
    23.0 万元
  • 项目类别:
    青年科学基金项目
IP地址驱动的多径路由及流量传输控制研究
  • 批准号:
    61872252
  • 批准年份:
    2018
  • 资助金额:
    64.0 万元
  • 项目类别:
    面上项目
针对内存攻击对象的内存安全防御技术研究
  • 批准号:
    61802432
  • 批准年份:
    2018
  • 资助金额:
    25.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Antibody-based therapy for fentanyl-related opioid use disorder
基于抗体的芬太尼相关阿片类药物使用障碍治疗
  • 批准号:
    10831206
  • 财政年份:
    2023
  • 资助金额:
    $ 23.98万
  • 项目类别:
Developing user-centric training in rigorous research: post-selection inference, publication bias, and critical evaluation of statistical claims.
在严谨的研究中开展以用户为中心的培训:选择后推断、发表偏见和统计声明的批判性评估。
  • 批准号:
    10721491
  • 财政年份:
    2023
  • 资助金额:
    $ 23.98万
  • 项目类别:
The corner liquor store: race, retail, and health risk in urban African American communities
街角酒类商店:城市非裔美国人社区的种族、零售和健康风险
  • 批准号:
    10395428
  • 财政年份:
    2021
  • 资助金额:
    $ 23.98万
  • 项目类别:
The corner liquor store: race, retail, and health risk in urban African American communities
街角酒类商店:城市非裔美国人社区的种族、零售和健康风险
  • 批准号:
    10633084
  • 财政年份:
    2021
  • 资助金额:
    $ 23.98万
  • 项目类别:
The corner liquor store: race, retail, and health risk in urban African American communities
街角酒类商店:城市非裔美国人社区的种族、零售和健康风险
  • 批准号:
    10115373
  • 财政年份:
    2021
  • 资助金额:
    $ 23.98万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了