Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data
在存在缺失和有偏差的电子健康记录数据的情况下,利用深度学习方法进行个性化风险预测
基本信息
- 批准号:10646324
- 负责人:
- 金额:$ 33.07万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-08-06 至 2025-05-31
- 项目状态:未结题
- 来源:
- 关键词:AgeAlgorithmsBenchmarkingBiological MarkersCalibrationCardiovascular DiseasesCharacteristicsChronic DiseaseClinicalClinical DataClinical MedicineClinical ResearchComputersComputing MethodologiesDataData SetDevelopmentDiagnosisDiseaseDisease OutcomeDisparateDrug PrescriptionsElectronic Health RecordFrequenciesGenderHealthHealth SurveysHealthcareHospitalsIndividualInstitutionInterviewLinkMedicalMethodologyMethodsModelingNeural Network SimulationNew YorkNew York CityNon-Insulin-Dependent Diabetes MellitusOutcomePatientsPhysical assessmentPopulationProbabilityProceduresRaceRecordsReproducibilityResearchResearch PersonnelRetirementRiskRisk FactorsSample SizeSamplingScientistSiteSoftware ToolsStatistical MethodsSurveysSystemUniversitiesValidationVariantVisitcohortdata modelingdata standardsdeep learningdemographicselectronic health dataexperienceflexibilityhealth care service utilizationimprovedinnovationinterestlearning strategymaltreatmentpatient populationpatient subsetspersonalized risk predictionpopulation basedpopulation healthpredictive modelingrecurrent neural networkrisk predictionrisk prediction modelweb app
项目摘要
Abstract
Since 2010, clinical medicine has benefited from a rapid surge of clinical research on chronic diseases using
data from electronic health records (EHRs). EHRs are appealing because they can offer large sample sizes,
timely information, and a wealth of clinical information beyond that obtained from either health surveys or
administrative data. However, while millions of patient records are included in large EHR records, they are not
population-representative random samples, a constraint that potentially biases inferences based on such data
and, therefore, has limited their utility for population health research. EHR data typically contain multiple types
of biases, particularly: 1) sampling inclusion bias: EHR data only include information on patients visiting
participating medical systems, and they primarily capture data when patients are ill. Even among populations
with a particular disease, patients represented in EHRs tend to over-represent individuals who are sicker and
have higher health care utilization; 2) sampling frequency bias: the numbers of patients’ encounters and
features in EHRs are at various frequencies and these frequencies correlate with both patients’ characteristics
and outcomes; and 3) institution bias: EHR samples of any hospital reflect the characteristics of patients
population served by that specific hospital. Consequently, EHR-based risk prediction models will have 1)
biases in risk factor selection and estimation for population inferences; 2) disparate mistreatment (unfairness)
in terms of variation in a model’s prediction accuracy across patient subgroups (such as gender, race, and age)
with various sampling inclusion probabilities or frequencies; 3) biased prediction model to reflect characteristics
of patients served by the local hospitals. We propose to develop: 1) effective sample-weighting method to
correct biases in risk factor selection and estimation for population inferences (Aim 1), 2) flexible deep learning
method for EHR personalized risk prediction with fairness criteria (Aim 2); and 3) innovative calibration method
to improve reproducibility of EHR-based risk models between institutions (Aim 3). We will predict risk of
subsequent incident cardiovascular disease (CVD) in patients with type 2 diabetes (T2DM) as a demonstration
of methodology development. Broader use of these methods will be generally applicable to other diseases
outcomes and population of interest. To develop and validate these methods, we propose to analyze three
unique datasets: 1) the New York University Langone Health EHR data (NYU-CDRN, 2009 to now) including
demographics, vitals, diagnoses, lab results, prescriptions, and procedures; 2) the New York City Clinical Data
Research Network (NYC-CDRN)—an EHR network comprising 20 NYC healthcare institutions, including the
NYU-CDRN, with longitudinally linked data on >12 million patient encounters under a Common Data Model,
and 3) the Health and Retirement Survey (HRS, begun in 1992 and ongoing), as a benchmark population-
based cohort, that has nationally representative health interview data for over 20 years, as well as biomarkers,
physical assessment information, prescription drug data, and claims linkages.
抽象的
自2010年以来,临床医学受益于慢性病临床研究的快速增长。
来自电子健康记录(EHR)的数据很有吸引力,因为它们可以提供大样本量,
及时的信息以及丰富的临床信息,超出了从健康调查或健康调查中获得的信息
然而,虽然大型 EHR 记录中包含数以百万计的患者记录,但它们并未包含在内。
具有代表性的随机样本,这是一种可能使基于此类数据的推论产生偏差的约束
因此,电子病历数据通常包含多种类型,因此限制了它们在人口健康研究中的效用。
偏差,特别是:1)抽样包含偏差:EHR 数据仅包含就诊患者的信息
参与的医疗系统,它们主要在患者生病时收集数据,甚至在人群中也是如此。
对于某种特定的疾病,电子病历中所代表的患者往往会过多地代表病情较重且病情较重的患者。
具有较高的医疗保健利用率;2)抽样频率偏差:患者就诊的数量和
EHR 中的特征具有不同的频率,这些频率与患者的特征相关
和结果;3) 机构偏差:任何医院的 EHR 样本都反映患者的特征
该特定医院所服务的人群,基于 EHR 的风险预测模型将具有 1)
风险因素选择和群体推断估计存在偏差;2) 不同的虐待(不公平)
模型在患者亚组(例如性别、种族和年龄)之间的预测准确性的差异
具有不同的采样包含概率或频率;3)有偏差的预测模型来反映特征
我们建议开发:1)有效的样本加权方法。
纠正风险因素选择和总体推断估计中的偏差(目标 1),2)灵活的深度学习
具有公平性标准的 EHR 个性化风险预测方法(目标 2)和 3)创新校准方法;
提高机构之间基于 EHR 的风险模型的可重复性(目标 3)。
以 2 型糖尿病 (T2DM) 患者随后发生的心血管疾病 (CVD) 为例
这些方法的更广泛应用将普遍适用于其他疾病。
为了开发和验证这些方法,我们建议分析三种方法。
独特的数据集:1) 纽约大学 Langone Health EHR 数据(NYU-CDRN,2009 年至今),包括
人口统计、生命体征、诊断、实验室结果、处方和程序 2) 纽约市临床数据;
研究网络 (NYC-CDRN) — 由 20 家纽约市医疗机构组成的 EHR 网络,其中包括
NYU-CDRN 在通用数据模型下拥有超过 1200 万患者就诊的纵向关联数据,
3) 健康与退休调查(HRS,1992 年开始并持续进行),作为基准人口-
基于队列,拥有 20 多年来具有全国代表性的健康访谈数据以及生物标志物,
身体评估信息、处方药数据和索赔链接。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Padhraic Smyth其他文献
Padhraic Smyth的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Padhraic Smyth', 18)}}的其他基金
Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data
在存在缺失和有偏差的电子健康记录数据的情况下,利用深度学习方法进行个性化风险预测
- 批准号:
10463550 - 财政年份:2021
- 资助金额:
$ 33.07万 - 项目类别:
相似国自然基金
基于肿瘤病理图片的靶向药物敏感生物标志物识别及统计算法的研究
- 批准号:82304250
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
多模态高层语义驱动的深度伪造检测算法研究
- 批准号:62306090
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
高精度海表反照率遥感算法研究
- 批准号:42376173
- 批准年份:2023
- 资助金额:51 万元
- 项目类别:面上项目
基于新型深度学习算法和多组学研究策略鉴定非编码区剪接突变在肌萎缩侧索硬化症中的分子机制
- 批准号:82371878
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
基于深度学习与水平集方法的心脏MR图像精准分割算法研究
- 批准号:62371156
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
相似海外基金
Hybrid Intelligence for Trustable Diagnosis And Patient Management of Prostate Cancer (HIT-PIRADS)
用于前列腺癌可信诊断和患者管理的混合智能 (HIT-PIRADS)
- 批准号:
10611212 - 财政年份:2023
- 资助金额:
$ 33.07万 - 项目类别:
Development of an Efficient High Throughput Technique for the Identification of High-Impact Non-Coding Somatic Variants Across Multiple Tissue Types
开发一种高效的高通量技术,用于鉴定跨多种组织类型的高影响力非编码体细胞变异
- 批准号:
10662860 - 财政年份:2023
- 资助金额:
$ 33.07万 - 项目类别:
Systematic Assessment of Combinatorial Transcription Factor Activity
组合转录因子活性的系统评估
- 批准号:
10897439 - 财政年份:2023
- 资助金额:
$ 33.07万 - 项目类别:
Predicting Clinical Phenotypes in Crohn's Disease Using Machine Learning and Single-Cell 'omics
使用机器学习和单细胞组学预测克罗恩病的临床表型
- 批准号:
10586795 - 财政年份:2023
- 资助金额:
$ 33.07万 - 项目类别:
Pathophysiological Evidence Driven Management of GERD in Neonatal ICU Infants: Randomized Controlled Trial
新生儿 ICU 婴儿 GERD 的病理生理学证据驱动管理:随机对照试验
- 批准号:
10717324 - 财政年份:2023
- 资助金额:
$ 33.07万 - 项目类别: