Free Text Gene Name Recognition
自由文本基因名称识别
基本信息
- 批准号:8149604
- 负责人:
- 金额:$ 19.59万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:
- 资助国家:美国
- 起止时间:至
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
One of the problems that is important for semantic processing of natural language text is named entity recognition. This problem seems to be inherently more difficult in the biological realm than it proved to be in the realm of business applications or news story analysis as in the MUC conferences. Our interest in the issue stems from its potential importance in indexing and retrieval of information dealing with a particular gene or protein. However really high quality named entity recognition in biology would have many applications as a starting point for semantic analysis. In past work on this problem we developed a tagger for gene/protein name recognition in text called ABGene and subsequently a database of 20,000 sentences annotated for the occurrence of gene/protein names. The first 15,000 of these sentences formed the basis of the gene/protein mention recognition task for the BioCreative I Workshop held in 2004. Subsequent to the BioCreative I Workshop the whole 20,000 sentence corpus was revised by 1) Removing tokenization and instead providing the text of the original sentence; 2) changing the annotations to be character based instead of token based; 3) revising the annotation guidelines to deal with some of the problems which had become apparent in the Workshop; 4) correcting some erroneous annotations that had come to our attention. The resulting data has become known as the GENETAG corpus. It has at least one unique property. Many of the annotated entities have alternative annotations associated with them so that more than one answer is correct for a particular entity. We believe this is important as many entities can be annotated in more than one way and for quite a number there is no clear single correct answer.
In 2005 we were invited to be co-organizers of BioCreative II and to be responsible for the gene mention recognition task. For this purpose we gave out the first 15,000 sentences of GENETAG as practice and training data and the last 5,000 sentences were used for testing. Whereas 14 teams participated in BioCreative I, 21 teams participated in BioCreative II. The top F score obtained on the gene/protein mention task in BioCreative I was 83.2% while the top score in BioCreative II was 87.2%. Because there were some changes in the annotation guidelines and some corrections to the data, one cannot say definitively how much progress this represents, but it does suggest progress. Conditional random fields were much more commonly used in BioCreative II and new approaches to the use of unannotated data also appeared. We performed an analysis of the annotations provided by all the participants and applied a conditional random fields approach to learn how to combine all predictions to make an improved prediction. In this we used 200 fold cross validation. We were able to achieve a balanced F score of 90.7%. This indicates that there is yet room for improvement in how individual systems perform on the problem of gene/protein mention detection. (with Larry Smith and Lorrie Tanabe).
We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. However, the best approach we have found for distinguishing the categories of gene/protein and not gene/protein is a new algorithm we term a priority model. Every token associated with any name in SEMCAT has associated with it two probabilities. The first probability is the probability that the token indicates that it is part of a gene/protein name and the second probability is an indicator of how reliable the token is as an indicator. With this model, given a phrase, one can compute an estimate of the probability that the phase is a gene/protein name. We find that with the priority model we can achieve an F score of 96% as compared with 95% for our best PCFG approach. (with Lorrie Tanabe).
The top performance for gene mention recognition in BioCreative II was by Rie Ando from IBM who introduced a technique called alternating structural optimization. This approach takes many labeling problems similar to named entity tagging, but simply tries to predict the occurrence of the names or the tokens from the surrounding textual context. When the SVM solution weight vectors for these many auxiliary problems have been learned, one performs a singular value decomposition and subtracts from each vector its first h components in the decomposition. This subtraction is only used to decrease the penalty in the regularization term of the cost function. The weight vectors are then relearned and the process is repeated. This is continued until convergence. The final result is a set of h components of the decomposition of the many weight vectors. One uses these components to enhance the learning on the actual named entity recognition task. This is a bit complicated and difficult to use. We are studying how we may be able to use a similar approach, but with a simpler method of applying the auxiliary learning to improve named entity recognition. One problem is how to combine such auxiliary learning with the SEMCAT data.
We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these task and participated in the second. In the second task we used the priority model to locate protein mentions and it proved very successful and competitive with other approaches.
对于自然语言文本的语义处理很重要的问题之一是实体识别。在生物领域中,这个问题似乎比在商业应用程序或新闻故事分析的领域中所证明的本质上更加困难。 我们对这个问题的兴趣源于其在索引和检索特定基因或蛋白质的信息中的潜在重要性。但是,在生物学中,真正的高质量命名实体识别将有许多应用程序作为语义分析的起点。在过去的这个问题上,我们在文本中开发了一个用于基因/蛋白质名称识别的标签,称为Abgene,随后是一个针对出现基因/蛋白质名称的20,000个句子的数据库。这些句子中的前15,000个构成了基因/蛋白质提及的识别任务的基础,该任务是我在2004年举行的生物辩护的工作。在生物依据的I研讨会之后,整个20,000个句子语料库进行了修订,通过1)进行修改,而是提供了原始句子的文本; 2)将注释更改为基于字符而不是基于令牌的; 3)修改注释指南,以应对在研讨会中显而易见的一些问题; 4)纠正我们注意的一些错误的注释。所得数据已被称为Genetag语料库。它至少具有一个独特的属性。许多带注释的实体都有与之相关的替代注释,因此对于特定实体,多个答案是正确的。我们认为这很重要,因为可以以多种方式注释许多实体,而且对于很大一部分,没有明确的单一正确答案。
在2005年,我们被邀请成为生物综合II的共同组织者,并负责基因提及的识别任务。为此,我们将Genetag的前15,000句话作为实践和培训数据,并将最后5,000个句子用于测试。而14个团队参加了生物依据I,而21个团队参加了生物依据II。在生物公制I中获得的基因/蛋白质提及任务获得的最高F分数为83.2%,而生物公制II的最高分数为87.2%。因为注释指南有一些变化和对数据的一些更正,所以不能明确地说出这代表了多少进展,但它确实暗示了进度。有条件的随机场更常用于生物综合性II中,并且还出现了使用未注释数据的新方法。我们对所有参与者提供的注释进行了分析,并采用了条件随机字段方法来学习如何结合所有预测以进行改进的预测。在此中,我们使用了200倍的交叉验证。我们能够达到90.7%的平衡F得分。这表明,在基因/蛋白质提及检测的问题上,单个系统的执行方式仍有改善的余地。 (与拉里·史密斯(Larry Smith)和洛里·塔纳贝(Lorrie Tanabe)在一起)。
我们已经确信,可以使用Medline中可能发生的不同类型的实体的更多信息来改善名称识别。这导致我们设计了一组语义类别,并尝试用可以从数据库和网站收获的实际名称填充这些类别。我们称结果为semcat。目前,它认可了75个类别,并且包含大约400万个名称字符串,分布在这些类别上。我们已经尝试了概率上下文免费语法和文本字符串的马尔可夫模型,以尝试学习如何识别不同类别中的实体。 但是,我们发现的最佳方法是区分基因/蛋白质的类别而不是基因/蛋白质,是一种新算法,我们称为优先级模型。 与Semcat中任何名称相关的每个令牌都与它相关联两个概率。第一个概率是令牌表明它是基因/蛋白质名称的一部分,第二个概率是代币作为指标的可靠性。对于这个模型,给定短语,可以计算该相是基因/蛋白质名称的概率的估计。我们发现,使用优先级模型,我们可以达到96%的F分数,而最佳PCFG方法的F分数为95%。 (与Lorrie Tanabe一起)。
基因在生物公约II中提到的识别的最高表现是IBM的Rie Ando,他引入了一种称为交替结构优化的技术。这种方法采用了许多标记问题,类似于命名实体标记,但仅仅试图从周围的文本上下文中预测名称或令牌的出现。当学习了这些许多辅助问题的SVM溶液权重矢量时,就会进行单数值分解并从每个矢量中减去其在分解中的第一个H组分。此减法仅用于减少成本函数正规化项中的惩罚。然后重复重复重量向量,然后重复该过程。一直持续到收敛为止。最终结果是许多重量向量分解的H组成部分。一个人使用这些组件来增强实际命名实体识别任务的学习。这有点复杂且难以使用。我们正在研究如何使用类似方法,但采用更简单的方法来应用辅助学习来改善命名实体识别。一个问题是如何将这种辅助学习与SEMCAT数据结合在一起。
We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction.我们组织了其中的第一个任务,并参加了第二任任务。在第二个任务中,我们使用优先模型来定位蛋白质提及,并证明它非常成功和竞争其他方法。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Willy Wilbur其他文献
Willy Wilbur的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Willy Wilbur', 18)}}的其他基金
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
- 批准号:
8558105 - 财政年份:
- 资助金额:
$ 19.59万 - 项目类别:
Natural Language Processing Techniques To Enhance Information Access.
增强信息访问的自然语言处理技术。
- 批准号:
8943224 - 财政年份:
- 资助金额:
$ 19.59万 - 项目类别:
Automatic Analysis and Annotation of Document Keywords in Biomedical Literature
生物医学文献中文档关键词的自动分析与标注
- 批准号:
8344960 - 财政年份:
- 资助金额:
$ 19.59万 - 项目类别:
PubMed Query Log Analysis and Use in Access Inhancement
PubMed 查询日志分析及其在访问增强中的使用
- 批准号:
7969244 - 财政年份:
- 资助金额:
$ 19.59万 - 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
- 批准号:
8149602 - 财政年份:
- 资助金额:
$ 19.59万 - 项目类别:
相似国自然基金
项目名称:从调节Wnt/β-catenin通路及干细胞niche研究肾气丸延缓衰老小肠干细胞增殖功能衰退的分子机制
- 批准号:81873349
- 批准年份:2018
- 资助金额:52.0 万元
- 项目类别:面上项目
叠音品牌名称对消费者知觉和态度的影响
- 批准号:71702134
- 批准年份:2017
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
信息中心网络数据平面大规模名称数据快速检索技术研究
- 批准号:61602346
- 批准年份:2016
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
项目名称:H7N9禽流感病毒全人源单克隆抗体中和表位鉴定及中和作用机制研究
- 批准号:81501793
- 批准年份:2015
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Role of Pcpe2 in Adipose Tissue Remodeling and Lipoprotein Metabolism
Pcpe2 在脂肪组织重塑和脂蛋白代谢中的作用
- 批准号:
10837655 - 财政年份:2023
- 资助金额:
$ 19.59万 - 项目类别:
Annotating dark ion-channel functions using evolutionary features, machine learning and knowledge graph mining
使用进化特征、机器学习和知识图挖掘注释暗离子通道函数
- 批准号:
10457684 - 财政年份:2022
- 资助金额:
$ 19.59万 - 项目类别:
Advanced End-to-End Relation Extraction with Deep Neural Networks
使用深度神经网络进行高级端到端关系提取
- 批准号:
10386881 - 财政年份:2020
- 资助金额:
$ 19.59万 - 项目类别:
Advanced End-to-End Relation Extraction with Deep Neural Networks
使用深度神经网络进行高级端到端关系提取
- 批准号:
10200889 - 财政年份:2020
- 资助金额:
$ 19.59万 - 项目类别:
Advanced End-to-End Relation Extraction with Deep Neural Networks
使用深度神经网络进行高级端到端关系提取
- 批准号:
10615695 - 财政年份:2020
- 资助金额:
$ 19.59万 - 项目类别: