Natural Language Processing Techniques To Enhance Information Access.

增强信息访问的自然语言处理技术。

基本信息

批准号：
8943224
负责人：
Willy Wilbur
金额：
$ 56.15万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

Recently we have been involved in several subprojects which use natural language processing techniques: 1) We have developed a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals. 2) We are studying paraphrases in MEDLINE abstracts. These come about because an author is describing some entity of interest and uses a phrase like "drug abuse" and then needing to describe the same entity again a sentence or two latter does not wish to use exactly the same wording again and may use a variant of the phrase such as "drug use" which in the context of "drug abuse" has substantially the same meaning. 3) An author disambiguation algorithm has been developed which relies on machine learning based on the assumption that if an author name is infrequent in the data it probably represents the same person in all documents where it is found. This gives us positive instances. Negative instances are sampled from pairs of documents that have no author in common. Such positive and negative data allows us to do machine learning on all aspects of the document other than the name in question. This allows us to learn how to weight this data for best performance in distinguishing the positive and negative instances from each other. This learning is then applied in individual name cases or spaces to determine which author document pairs represent the same author. 4) We are using results of dependency parsers and syntactic parsers to create features for improved machine learning and also to automatically find good titles for document clusters.

最近，我们参与了使用自然语言处理技术的几个子弹： 1）我们已经开发了一种用于文本缩写定义识别的机器学习算法，该算法利用了我们所说的自然标记数据。积极的训练例子是文本中自然存在的潜在缩写定义对。负面训练的例子是通过将潜在缩写与无关潜在定义的随机混合而产生的。训练机器学习者可以区分这两组示例。然后，学习的特征权重用于识别缩写完整形式。这种方法不需要手动标记的培训数据。我们评估了我们在AB3P，Bioadi和Medcrats Corpora上的算法的性能。我们的系统展示了与现有的AB3P和BioAdi系统相比有利的结果。我们在AB3P语料库上获得了91.36％的F量，而Bioadi语料库的F量为87.13％，这比AB3P和Bioadi Systems报告的结果优于。此外，我们在召回方面胜过这些系统，这是我们的目标之一。 2）我们正在研究Medline摘要。这些之所以出现，是因为作者描述了某些感兴趣的实体，并使用诸如“滥用毒品”之类的短语，然后需要再次描述相同的实体，而后者则不希望再次使用完全相同的措辞，并且可以使用诸如“吸毒”的变体（在“药物滥用”中具有实质上相同的含义。 3）已经开发了一个作者放弃歧义算法，该算法依赖于机器学习的假设，即如果在数据中，如果作者名称很少，则可能代表所有发现的文档中的同一个人。这给了我们积极的实例。负面实例是从没有共同作者的一对文件中取样的。这样的正面和负面数据使我们能够对文档的所有方面进行机器学习，而不是所讨论的名称。这使我们能够学习如何加权此数据，以最佳性能，以区分彼此的正面和负面实例。然后将此学习应用于单个名称案例或空间，以确定哪个作者文档对代表同一作者。 4）我们使用依赖性解析器和句法解析器的结果来创建用于改进机器学习的功能，并自动为文档簇找到良好的标题。