Towards the Building of a Comprehensive Searchable Biological Experiment Database

建立综合可检索的生物实验数据库

基本信息

批准号：
7314689
负责人：
HONG YU
金额：
$ 23.01万
依托单位：
UNIVERSITY OF WISCONSIN MILWAUKEE
依托单位国家：
美国
项目类别：
财政年份：
2007
资助国家：
美国
起止时间：
2007-12-01 至 2009-11-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7314689
关键词：
Adoption Advanced Development Algorithms Binding Biological Biomedical Research Categories Classification Data Databases Development Dictionary Documentation Flowcharts Genes Genome Hybrids Image Index Medicus Information Retrieval Information Retrieval Systems Information Systems Link Literature Machine Learning Maps Natural Language Processing Online Mendelian Inheritance In Man Ontology Polymerase Chain Reaction Principal Investigator Property Protocols documentation PubMed Publications Reporting Research Science Scientist Specific qualifier value Standards of Weights and Measures SwissProt System Techniques Technology Testing Text Title United States National Academy of Sciences United States National Library of Medicine Western Blotting abstracting base knowledge base novel open source programs rapid growth research study tool

项目摘要

DESCRIPTION (provided by applicant): The rapid growth of the biomedical literature and the expansion in disciplinary biomedical research, heralded by high-throughput genome sciences and technologies, have overwhelmed scientists who attempt to assimilate information necessary for their research. The widespread adoption of title/abstract word searches, such as highly desirable the National Library of Medicine's PubMed system, has provided the first major advance in the way bioscientists find relevant publications since the origin of Index Medicus in 1879 (Hunter and Cohen 2006). The importance of developing valid information retrieval systems for bioscientists has led to the development of information systems worldwide (e.g., Arrowsmith (Smalheiser and Swanson 1998), BioText (Hearst 2003), GeneWays (Friedman et al. 2001; Rzhetsky et al. 2004), iHOP (Hoffmann and Valencia 2005), and BioMedQA (Lee et al. 2006a), and annotated databases (e.g., SWISSPROT, OMIM (Hamosh et al. 2005) and BIND (Alfarano et al. 2005)). However, most of information systems target only text information and fail to provide access to other important data such as images (e.g., figures). More than any other documentation, figures usually represent the "evidence" of discovery in the biomedical literature. Full-text biological articles nearly always incorporate figures/images that are the crucial content of the biomedical literature. Our examination of biological articles in the Proceedings of the National Academy of Sciences (PNAS) revealed the occurrence of 5.2 images per article on average (Yu and Lee 2006a). Biologists need to access image data to validate research facts and to formulate or to test novel research hypotheses. It has been evaluated that textual statements reported in literature frequently are noisy (i.e., containing "false facts") (Krauthammer et al. 2002). Capturing images that are experimental "evidence" to support the textual "fact" will benefit bioscience information systems, databases, and bioscientists. Unfortunately, this wealth of information remains virtually inaccessible without automatic systems to organize these images. We propose the development of advanced natural language processing (NLP) tools to semantically organize images. We hypothesize that text that associated with images semantically entails the image content and natural language processing techniques can be developed to accurately associate the text to their images. Furthermore, we hypothesize that images can be semantically organized by categories specified by standard biological ontology, and that natural language processing approaches can accurately assign the ontological categories to images. Our specific aims are: Aim 1: To develop and evaluate NLP techniques for identifying textual statements that correspond to images in full-text articles. We will develop different approaches for two types of the associations. We will first propose rule-based and statistical approaches to identify the associated text that appears in the full-text articles. We will then develop hybrid approaches to link sentences in abstracts to images in the body of the articles. Aim 2: To develop and evaluate NLP techniques for automatic classification of experimental results into categories (e.g., Western-Blot, PCR verification, etc) specified in the experimental protocol Protocol-Online. We will explore the use of dictionary-based, rule-based, image classification, and machine-learning approaches for accomplishing this aim. Aim 3: To develop and evaluate NLP techniques for automatic assignment of Gene Ontology categories to experiments, which will provide a knowledge-based organization of experiments according to biological properties (e.g., catalytic activity). We will develop statistical and machine-learning approaches for accomplishing this aim. We found that most of the images that appear in full-text biological articles are figure images (Yu and Lee 2006a) and we therefore focus on figure images only in this proposal. The deliverable of Specific Aim 1 will be an effective user-interface BioEx from which bioscientists can access images directly from sentences in the abstracts. BioEx has the promise of improvement over the traditional single-document-per-article format that has dominated bioscience publications since the first scientific article appeared in 1665 (Gross 2002). The deliverables of Specific Aim 2 and 3 will be open-source algorithms and tools that accurately map images to categories specified by the Gene Ontology and the Protocol Online. Those algorithms and tools will enhance bioscience information retrieval, information extraction, summarization, and question answering.

描述（由申请人提供）：在高通量基因组科学和技术中，生物医学文献的快速增长和纪律生物医学研究的扩展使他们不堪重负的科学家，他们试图吸收其研究所必需的信息。广泛采用标题/抽象单词搜索，例如高度理想的国家医学图书馆的PubMed系统，它为生物科学家寻找相关出版物的首个主要进步提供了自1879年Index Medicus起源以来（Hunter and Cohen 2006）。为生物科学家开发有效的信息检索系统的重要性导致了全球信息系统的发展（例如，Arrowsmith（Smalheiser和Swanson和Swanson 1998），Biotext（Hearst 2003），Geneways，Geneways（Friedman等，2001; Rzhetsky etel。2001; Rzhetsky etal。2004），Ihopmann and biecia and biencia and biecia and and and and and and and and and and and and and and and。 2006a）和注释的数据库（例如Swissprot，Omim（Hamosh等，2005）和Bind（Alfarano等，2005））。但是，大多数信息系统仅针对文本信息，而无法访问其他重要数据，例如图像（例如图形）。数字比任何其他文档都多，通常代表生物医学文献中发现的“证据”。全文生物学文章几乎总是结合了生物医学文献中至关重要的内容的图像/图像。我们在美国国家科学院会议录中对生物学文章（PNAS）的研究表明，平均每篇文章发生了5.2张图像（Yu and Lee 2006a）。生物学家需要访问图像数据以验证研究事实并制定或检验新的研究假设。已经评估了文献中报道的文本陈述经常是嘈杂的（即包含“错误的事实”）（Krauthammer等，2002）。捕获具有实验性“证据”的图像来支持文本“事实”将使生物科学信息系统，数据库和生物科学家受益。不幸的是，如果没有自动系统来组织这些图像，这些信息实际上仍然无法访问。我们建议开发先进的自然语言处理（NLP）工具来组织图像。我们假设与图像相关联的文本需要开发图像内容和自然语言处理技术，以将文本准确地关联到其图像。此外，我们假设图像可以按照标准生物本体论规定的类别进行语义组织，并且自然语言处理方法可以准确地将本体论类别分配给图像。我们的具体目的是：目标1：开发和评估NLP技术，以识别与全文文章中图像相对应的文本语句。我们将针对两种类型的关联开发不同的方法。我们将首先提出基于规则的统计方法，以识别全文文章中出现的相关文本。然后，我们将开发混合方法，以将抽象的句子与文章正文中的图像联系起来。目标2：开发和评估NLP技术，以自动将实验结果分类为类别（例如Western-slot，PCR验证等）。我们将探讨使用基于字典的，基于规则的图像分类以及实现此目标的机器学习方法的使用。目标3：开发和评估NLP技术以自动分配基因本体学类别为实验，该技术将根据生物学特性（例如催化活性）提供基于知识的实验组织。我们将开发统计和机器学习方法来实现这一目标。我们发现，全文生物文章中出现的大多数图像都是图像图像（Yu and Lee 2006a），因此我们仅在此提案中专注于图像。特定AIM 1的可交付方式将是一个有效的用户界面Bioex，生物科学家可以从摘要中直接访问图像。自从第一篇科学文章发表于1665年（Gross 2002）以来，Bioex有望改善传统的单一文档格式，该格式一直主导了生物科学出版物（Gross 2002）。特定目标2和3的可交付成果将是开源算法和工具，可准确地将图像映射到基因本体论和在线协议指定的类别。这些算法和工具将增强生物科学信息检索，信息提取，摘要和问题答案。