DeconDTN: Deconfounding Deep Transformer Networks for Clinical NLP

DeconDTN：为临床 NLP 解构深度 Transformer 网络

基本信息

批准号：
10467107
负责人：
Trevor Cohen
金额：
$ 34.53万
依托单位：
UNIVERSITY OF WASHINGTON
依托单位国家：
美国
项目类别：
财政年份：
2022
资助国家：
美国
起止时间：
2022-06-01 至 2026-02-28
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10467107
关键词：
Address Architecture Area Artificial Intelligence Automobile Driving Behavior Bridge to Artificial Intelligence COVID-19 Caring Characteristics Classification Clinical Clinical Services Cognitive Computer software Confounding Factors (Epidemiology)Coupled Data Data Aggregation Data Set Data Sources Dementia Development Diagnosis Diagnostic Ensure Equilibrium Evaluation Goals High Prevalence Individual Institution Investments Label Language Learning Linguistics Location Medical Methods Modeling Modification Natural Language Processing Nature Neural Network Simulation Outcome Output Participant Patients Performance Physicians Predictive text Prevalence Research SARS-CoV-2 positive Sampling Services Site Source Speech Systematic Bias Testing Text Time Training Transcript United States National Institutes of Health United States National Library of Medicine Update Vision Weight Work base coronavirus disease deep learning deep learning model design heterogenous data interest large datasets learning strategy loss of function machine learning model network models novel open source open source tool portability predictive modeling programs relating to nervous system statistical and machine learning

项目摘要

Natural Language Processing (NLP) methods have been broadly applied to clinical problems, from recognition of clinical findings in physician notes to identification of transcribed speech samples indicating changes in cognitive status. Deep transformer networks (DTNs) have dramatically advanced NLP accuracy. These deep learning models have multiple hidden layers that may correspond to billions of trainable parameters, allowing them to apply information learned from training on large unlabeled corpora to a specific task of interest. However, their size leaves them especially vulnerable to confounding bias, induced by variables that can influence both the predictor (text) and the outcome (e.g. an associated diagnosis) of a predictive model. Such systematic biases are a recognized danger in the application of artificial intelligence methods to clinical problems, and are the focus of NLM NOT-LM-19-003 which invites applications proposing methods to identify and address them. Deep learning models in general require large amounts of training data, spurring initiatives to aggregate medical data from across institutional siloes. This can increase data set size and enhance model portability, but leaves the resulting models vulnerable to confounding by provenance, where models learn to recognize the origin of dataset components and make biased predictions based on site-specific class distributions (e.g. COVID prevalence). Such models will assign classes based on indicators of dataset provenance, rather than diagnostically meaningful linguistic differences, and make erroneous predictions when the provenance-specific distributions at the point of deployment differ from those in the training set. Confounding of this nature is a pervasive problem that presents a fundamental barrier to the portability of trained models, and threatens the utility of datasets assembled from across institutions and services. Unlike traditional statistical and machine learning models, with deep transformer networks feature representations are distributed across parameters spread throughout the entire network. New methods are needed to meet the challenge of identifying and mitigating the influence of confounding variables in such models. In the proposed research we will develop a systematic approach to Deconfounding Deep Transformer Networks (DeconDTN), embodied in an eponymous and publicly available set of open source tools for (1) identification of provenance-related biases, (2) mitigation of these biases using a novel set of validated methods, and (3) systematic evaluation of the resulting effects on model performance. While DeconDTN will be generally applicable, development and evaluation will occur in the context of three use cases involving data sets drawn from different sources: classification of speech transcripts from participants with dementia drawn from two locations, identification of goals-of-care discussions in clinical notes drawn from multiple studies involving a range of clinical services, and prediction of COVID-19 status in notes drawn from different clinical units. Our driving hypothesis is that the resulting models will make more accurate predictions in these heterogenous datasets than corresponding models without correction for confounding by provenance.

自然语言处理（NLP）方法已广泛应用于临床问题，从识别中医师注释中的临床发现，以识别转录的语音样本，表明变化认知状况。 Deep Transformer网络（DTN）具有明显的高级NLP精度。这些很深学习模型具有多个隐藏的图层，可能对应数十亿个可训练的参数，从而允许他们将从大型未标记语料库培训中学到的信息应用于特定的感兴趣任务。然而，它们的大小使它们特别容易受到混淆偏见的影响，这会影响两者的变量预测模型的预测因子（文本）和结果（例如相关诊断）。这种系统的偏见是将人工智能方法应用于临床问题的公认危险，并且是重点 NLM NOT-LM-19-003的邀请邀请应用程序提出方法来识别和解决它们。深的一般学习模型需要大量的培训数据，促使计划汇总医疗数据来自整个机构孤岛。这可以增加数据集大小并增强模型可移植性，但留下结果模型很容易受到出处的混淆，模型学会识别数据集的来源基于特定于网站的类别分布（例如COVID患病率）的组件并做出有偏见的预测。这样的模型将根据数据集出处的指标分配类，而不是诊断有意义的语言差异，并在特定于出处的分布处做出错误的预测部署点与培训集中的分部不同。这种性质的混淆是一个普遍的问题这给训练有素的模型的便携性带来了基本障碍，并威胁了数据集的效用从机构和服务中汇集。与传统的统计和机器学习模型不同，深层变压器网络特征表示形式分布在分布的参数上整个网络。需要新的方法来满足识别和减轻影响的挑战在此类模型中混淆变量。在拟议的研究中，我们将开发一种系统的方法 Deconfrefressing深层变压器网络（Decondtn），体现在同名且公开的用于（1）识别出处相关偏见的开源工具集，（2）使用这些偏差来缓解这些偏见一组新颖的经过验证的方法，以及（3）对产生对模型性能的影响的系统评估。虽然Decondtn通常适用，但开发和评估将在三种使用情况下进行涉及来自不同来源的数据集的案例：来自参与者的语音成绩单分类痴呆症来自两个位置，鉴定临床注释中的护理目标讨论的目标涉及一系列临床服务的多项研究，以及从中提取的笔记中对COVID-19状态的预测不同的临床单位。我们的驾驶假设是，由此产生的模型将在这些异质数据集比相应的模型没有校正来进行混淆。