Interactive machine learning methods for clinical natural language processing

用于临床自然语言处理的交互式机器学习方法

基本信息

批准号：
8818096
负责人：
HUA XU
金额：
$ 55.84万
依托单位：
UNIVERSITY OF TEXAS HLTH SCI CTR HOUSTON
依托单位国家：
美国
项目类别：
财政年份：
2010
资助国家：
美国
起止时间：
2010-05-31 至 2018-09-28
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/8818096
关键词：
Abbreviations Active Learning Address Adoption Algorithms Attention Biomedical Research Classification Clinical Clinical Data Clinical Informatics Clinical Research Cognitive Communities Data Data Set Development Disease Educational workshop Electronic Health Record Face Goals Grant Human Hybrids Knowledge Label Learning Linguistics Machine Learning Manuals Medical Methodology Methods Modeling Names Natural Language Processing Patients Pattern Performance Pharmaceutical Preparations Physicians Process Research Research Personnel Research Priority Resources Sampling Solutions Source Specific qualifier value Statistical Methods Statistical Models System Technology Testing Text Time United States National Library of Medicine base clinical application clinical phenotype cohort computer human interaction computerized cost experience improved model development novel open source statistics success tool usability

项目摘要

DESCRIPTION (provided by applicant): Growing deployments of electronic health records (EHRs) systems have made massive clinical data available electronically. However, much of detailed clinical information of patients is embedded in narrative text and is not directly accessible for computerized clinical applications. Therefore, natural language processing (NLP) technologies, which can unlock information in narrative document, have received great attention in the medical domain. Current state-of-the-art NLP approaches often involve building probabilistic models. However, the wide adoption of statistical methods in clinical NLP faces two grand challenges: 1) the lack of large annotated clinical corpora; and 2) the lack of methodologies that can efficiently integrate linguistic and domain knowledge with statistical learning. High-performance statistical NLP methods rely on large scale and high quality annotations of clinical text, but it is time-consuming and costly to create large annotated clinica corpora as it often requires manual review by physicians. Moreover, the medical domain is knowledge intensive. To achieve optimal performance, probabilistic models need to leverage medical domain knowledge. Therefore, methods that can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost would be highly desirable for clinical text processing. In this study, we propose to investigate interactive machine learning (IML) methods to address the above challenges in clinical NLP. An IML system builds a classification model in an iterative process, which can actively select informative samples for annotation based on models built on previously annotated samples, thus reducing the annotation cost for model development. More importantly, an IML system also involves human inputs to the learning process (e.g., an expert can specify important features for a classification task based on domain knowledge). Thus, IML is an ideal framework for efficiently integrating rule-based (via domain experts specifying features) and statistics-based (via different learning algorithms) approaches to clinical NLP. To achieve our goal, we propose three specific aims. In Aim 1, we plan to investigate different aspects of IML for word sense disambiguation, including developing new active learning algorithms and conducting cognitive usability analysis for efficient feature annotation by users. To demonstrate the broad uses of IML, we further extend IML approaches to two other important clinical NLP classification tasks: named entity recognition and clinical phenoytping in Aim 2. Finally we propose to disseminate the IML methods and tools to the biomedical research community in Aim 3.

描述（由申请人提供）：电子健康记录 (EHR) 系统的不断部署使大量临床数据可以电子方式获得。然而，患者的许多详细临床信息都嵌入在叙述性文本中，并且无法直接用于计算机化临床应用。因此，能够解锁叙述性文档信息的自然语言处理（NLP）技术在医学领域受到了极大的关注。当前最先进的 NLP 方法通常涉及构建概率模型。然而，统计方法在临床自然语言处理中的广泛采用面临着两大挑战：1）缺乏大量带注释的临床语料库； 2）缺乏能够有效地将语言和领域知识与统计学习相结合的方法。高性能统计 NLP 方法依赖于大规模和高质量的临床文本注释，但创建大型注释临床语料库既耗时又昂贵，因为它通常需要医生进行手动审核。此外，医学领域是知识密集型的。为了实现最佳性能，概率模型需要利用医学领域知识。因此，临床文本处理非常需要能够有效地将领域和专家知识与机器学习过程相结合，以最小注释成本快速构建高质量概率模型的方法。在本研究中，我们建议研究交互式机器学习（IML）方法来解决临床 NLP 中的上述挑战。 IML系统在迭代过程中构建分类模型，可以基于先前注释样本构建的模型主动选择信息丰富的样本进行注释，从而降低模型开发的注释成本。更重要的是，IML 系统还涉及学习过程中的人工输入（例如，专家可以根据领域知识为分类任务指定重要特征）。因此，IML 是一个理想的框架，可以有效地将基于规则（通过领域专家指定特征）和基于统计（通过不同的学习算法）的方法集成到临床 NLP 中。为了实现我们的目标，我们提出了三个具体目标。在目标 1 中，我们计划研究 IML 的词义消歧的不同方面，包括开发新的主动学习算法和进行认知可用性分析，以便用户进行有效的特征注释。为了展示 IML 的广泛用途，我们进一步将 IML 方法扩展到另外两个重要的临床 NLP 分类任务：目标 2 中的命名实体识别和临床表型分析。最后，我们建议在目标 3 中向生物医学研究界传播 IML 方法和工具。