Text Analytics, Knowledge Engineering, & High Performance Computing

文本分析、知识工程、

基本信息

批准号：
8565613
负责人：
calvin a johnson
金额：
$ 272.69万
依托单位：
CENTER FOR INFORMATION TECHNOLOGY
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/8565613
关键词：
3-Dimensional AIDS/HIV problem Address Aging Algorithms Alzheimer&apos s Disease Antimicrobial Resistance Applications Grants Architecture Biological Biological Assay Biological Markers Biomedical Research Cataloging Catalogs Categories Classification Clinical Clinical Data Clinical Informatics Code Collaborations Collection Communication Communities Complex Computer Analysis Computing Methodologies Coupled Critiques Custom Data Data Analyses Data Set Development Diagnosis Discipline Disease Disease Association Drosophila genome Engineering Epidemiologic Studies Epidemiology Evaluation Evidence Based Medicine Extramural Activities Fostering Funding Funding Agency Gene Expression Genes Genomics Goals Grant Grant Review Process Guidelines High Performance Computing Human Image Imagery Individual Informatics Information Resources Investigation Job Description Knowledge Leadership Linguistics Location Lysosomes Machine Learning Management Information Systems Maps Measures Medical Melissa Methodology Methods Metric Modeling Molecular Bank Mouth Diseases National Institute of Allergy and Infectious Disease National Institute of Dental and Craniofacial Research National Institute of Diabetes and Digestive and Kidney Diseases National Institute of Neurological Disorders and Stroke Natural Language Processing Occupational Online Systems Ontology Outcome Pattern Peer Review Performance Production Proteins Proteomics Protocols documentation RNA Reporting Research Research Infrastructure Research Personnel Research Project Grants Resource Sharing Resources Retrieval Running Saliva Salivary Proteins Science Scientist Screening procedure Semantics Side Software Engineering Software Tools Specific qualifier value System Systemic disease Systems Biology Techniques Technology Testing Text Training Transcription Initiation Site Translational Research United States National Institutes of Health Work base behavioral/social science biological systems biomedical informatics cancer epidemiology cell type cluster computing comparative effectiveness data management data mining data sharing effectiveness research evidence base experience fluorescence imaging high throughput screening improved information organization innovation insight interoperability knowledge base novel peer programs research study response social social science research text searching tool

项目摘要

The Text Analytics, Knowledge Engineering, and High Performance Computing Program, which operates within the High Performance Computing and Informatics Office (HPCIO), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, knowledge engineering, computational linguistics, text and data mining, natural language processing, machine learning, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, "big data" analysis, and portfolio analysis. In 2012, collaborative efforts in support of these goals included the following. - The human salivary protein catalog has been made available online on a community-based Web portal developed by HPCIO, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - In collaboration with NCI, HPCIO is investigating document classifiers trained using machine-learning methods. One aspect of this collaboration involves the development of a system to match ClinicalTrials.gov protocols with their funding source in IMPAC II. The need for such a system is motivated by the fact that the NIH project number is specified in only 20% to 25% of all NIH-sponsored protocols. An intended outcome of this matching system is improved classifier performance by augmenting grant document text with matching protocol text. - In response to input from various collaborative groups, HPCIO is developing a portfolio visualization resource, dubbed PViz, that integrates visualization of categorical data with results of clustering algorithms, to allow analysts to gain new insight into their data. Users may either construct a portfolio from IMPAC II data or import their own custom portfolio of categorical data. - In collaboration with the Division of Planning, Coordination, and Strategic Initiatives (DPCPSI/OD), we have trained a "one-sided" classifier on a set of Comparative Effectiveness Research (CER) exemplars. The results of this investigation suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at retrospectively identifying CER grants. - HPCIO has demonstrated the utility of its integrated portfolio clustering and visualization resource on NIAID's Anti-Microbial Resistance portfolio. The current focus of the collaboration with NIAID is to investigate various machine-learning methods (including unsupervised, semi-supervised, and fully supervised algorithms) to map projects to NIAID HIV/AIDS priorities, objectives, and initiatives. - HPCIO has been collaborating with the Molecular Libraries Program (MLP), part of the NIH Common Fund, to develop the Common Assay Reporting System (CARS). CARS is an integrated system for managing bioassay information and facilitating communication bettween all the high-throughput screening centers within the Molecular Libraries Probe Production Centers Network (MLPCN). Goals for this collaboration include: 1) Track project status and related issues at eaach of the screening centers within the MLPCN, and provide the means for information collection, sharing and retrieval among the centers and the program office at NIH. 2) Establish a standardized protocol to describe raw data from the experiments and report screening data to the scientific community. - A novel statistical test has been developed to identify differential expressed RNA from RNAseq count data. This work will provide a better idea of the biological differences between cell types. - HPCIO is working with Melissa Friesen of NCI to develop methodologies to improve exposure classification in occupational epidemiologic studies. Initial effort of this collaboration involves a tool that helps experts to classify free-text job descriptions into standard occupational codes. Machine-learning based classification methods will also be utilized to help with evaluating exposure-disease associations. - In collaboration with NINDS, HPCIO has implemented and compared several methods to locate and characterize lysosomes in 3-D fluorescence images. The goal is to be able to calculate the pH of each lysosome in the image, for which the ability to resolve their locations is an important step. - Machine-learning methods have been devised and implemented to identify and refine transcription start sites in the fruit fly genome found using cap analysis gene expression (CAGE). This effort is in collaboration with Brian Oliver of NIDDK. - We are applying machine-learning methods to identify important terms that peer reviewers use to describe innovative applications. The goal of the effort is to develop a lexicon of terms that can help estimate the innovation level of a grant application based on peer review critiques from the applications NIH Summary Statement. - HPCIO is working with NINDS and the Office of Extramural Research (OER) to determine peer-review sentiment of grant applications based on the NIH Summary Statement. The sentiment analysis results can provide decision support information to NIH program directors considering applications for selective pay. - In collaboration with NIA, we are applying machine learning and visualization techniques on mass biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. Omnimorph, a graphic data analysis tool, is being developed for multidimensional data visualization. - Although the scientific impact of NCI consortia on the advancement of cancer epidemiology research is understood to be significant, accurate quantitative metrics of this impact are needed by program leadership. We are developing methods to track citations to clinical guidelines in the context of evidence-based medicine that could provide funding agencies and program directors insight into individual consortias contributions in advancing medical knowledge. This work is being conducted in collaboration with Epidemiology and Genomics Research Program (EGRP), NCI. - In collaboration with George Chacko of CSR, HPCIO is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. The effort so far has concentrated on exploratory analysis against the NIH portfolio to evaluate clustering methods and assess intrinsic measures of cluster quality. - Based on its experience in building novel models for classifying research grants and projects, HPCIO is collaborating with DPCPSI/OD and NCI to develop a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. - The Office of Behavioral and Social Sciences Research (OBSSR) is conducting a pilot investigation in collaboration with HPCIO to evaluate the efficacy of machine learning models for the classification of five BSSR-relevant research categories. - NIA and the Alzheimer's Association have developed a Common Alzheimer's Disease Research Ontology (CADRO) to categorize Alaheimer's Disease Research. HPCIO is in collaboration with NIA to develop classifiers for the six categories, 45 topics, and 145 themes.

文本分析，知识工程和高性能计算计划在高性能计算和信息办公室（HPCIO）内运行，CIT计算生物科学的分工正在与NIH研究人员合作，以在文本和数值分析中建立一个批判性质量，这些分析被设想，这些分析被设想，包括许多实质性的和相关的元素研究，包括众多的元素和相关的元素研究，用于综合性的元素研究，是综合性的。采矿，自然语言处理，机器学习和可视化。该计划旨在促进NIH关键领域的进步，包括生物医学和临床信息学，转化研究，基因组学，蛋白质组学，系统生物学，“大数据”分析和投资组合分析。 2012年，为支持这些目标的协作努力包括以下内容。 - 人类唾液蛋白目录已在HPCIO与NIDCR合作开发的基于社区的Web门户网站上在线提供，以使科学家能够添加自己的研究数据，共享结果并发现新知识。这是朝着发现和使用唾液生物标志物来诊断口腔和系统性疾病的主要一步。 - 与NCI合作，HPCIO正在研究使用机器学习方法培训的文档分类器。这项合作的一个方面涉及开发一个与临床过程相匹配的系统。gov协议与其资金来源在IMPAC II中。对这种系统的需求是由仅在所有NIH赞助协议的20％至25％中指定的NIH项目编号的动机。该匹配系统的预期结果是通过使用匹配协议文本增强赠款文档文本的提高分类器性能。 - 为了响应各种协作小组的输入，HPCIO正在开发称为PVIZ的投资组合可视化资源，该资源将分类数据的可视化与聚类算法的结果集成在一起，以允许分析师获得对其数据的新见解。用户可以从IMPAC II数据中构建投资组合，也可以导入自己的自定义数据组合。 - 与规划，协调和战略计划（DPCPSI/OD）的部门合作，我们在一系列比较有效性研究（CER）示例中培训了一个“单方面”分类器。这项调查的结果表明，当与有效的注释策略结合使用时，这种分类器可以回顾性地识别CER赠款非常有效。 -HPCIO证明了其在NIAID的抗微生物抗性投资组合中的集成投资组合聚类和可视化资源的实用性。与NIAID合作的当前重点是研究各种机器学习方法（包括无监督，半监督和完全监督算法），以将项目映射到NIAID HIV/AIDS的优先事项，目标和计划。 -HPCIO一直与NIH普通基金的一部分分子图书馆计划（MLP）合作，以开发共同的测定报告系统（CARS）。 CARS是一个集成系统，用于管理生物测定信息并促进分子库中所有高通量筛选中心探测生产中心网络（MLPCN）之间的通信。此协作的目标包括：1）MLPCN内筛选中心的EAACH跟踪项目状态和相关问题，并为NIH中心和计划办公室之间的信息收集，共享和检索提供了手段。 2）建立标准化协议，以描述实验中的原始数据，并向科学界报告筛选数据。 - 已经开发了一种新的统计检验，以鉴定从RNASEQ计数数据中鉴定出差异的RNA。这项工作将更好地了解细胞类型之间的生物学差异。 -HPCIO正在与NCI的Melissa Friesen合作，开发方法，以改善职业流行病学研究中的暴露分类。这项合作的初步努力涉及一种工具，该工具可以帮助专家将自由文本的作业描述分类为标准职业代码。基于机器学习的分类方法也将用于评估暴露症疾病关联。 - 与Ninds合作，HPCIO已实施并比较了几种定位和表征3-D荧光图像中溶酶体的方法。目的是能够计算图像中每个溶酶体的pH值，而解决位置的能力是重要的一步。 - 已经设计并实施了机器学习方法，以识别和完善使用CAP分析基因表达（CAGE）发现的果蝇基因组中的转录起始位点。这项工作与Niddk的Brian Oliver合作。 - 我们正在应用机器学习方法来确定同行评审者用来描述创新应用的重要术语。努力的目的是开发一个术语的词典，可以帮助估算基于NIH摘要声明的同行评审批评的赠款申请的创新水平。 -HPCIO正在与NINDS和校外研究办公室（OER）合作，以根据NIH摘要声明确定赠款申请的同行评审情感。情绪分析结果可以为NIH计划董事提供决策支持信息，以考虑选择性薪酬的申请。 - 与NIA合作，我们将机器学习和可视化技术应用于大众生物学数据集，以发现与衰老相关的功能基因或蛋白质相互作用的新型模式。 Omnimorph是一种图形数据分析工具，正在开发用于多维数据可视化。 - 尽管NCI联盟对癌症流行病学研究的进步的科学影响被认为是重要的，但计划领导需要对这种影响的准确定量指标。我们正在开发在循证医学的背景下追踪引用临床准则的方法，这些医学可以为基础机构和计划导演提供对促进医学知识的个人联盟贡献的洞察力。这项工作正在与NCI的流行病学和基因组学研究计划（EGRP）合作进行。 - 与CSR的George Chacko合作，HPCIO正在应用文本分析，以在评估赠款审查过程中为CSR领导提供基于证据的决策支持。迄今为止的努力集中在针对NIH投资组合的探索性分析上，以评估聚类方法并评估集群质量的内在度量。 - 基于其在为研究赠款和项目分类的新颖模型方面的经验，HPCIO与DPCPSI/OD和NCI合作开发了一个全面的分类工作流程系统，该系统将允许用户从多个分类算法，功能空间和培训方案中从多个分类算法中进行选择，以构建和运行自己的分类器。 - 行为和社会科学研究办公室（OBSSR）正在与HPCIO合作进行试点调查，以评估机器学习模型对五个与BSSR相关的研究类别的分类的功效。 - NIA和阿尔茨海默氏症协会开发了一个普通的阿尔茨海默氏病研究本体（CADRO），以对Alaheimer氏病研究进行分类。 HPCIO与NIA合作，为六个类别，45个主题和145个主题开发分类器。