Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts

开发机器学习模型来分析大型异构队列中的拼接数据

基本信息

  • 批准号:
    10506326
  • 负责人:
  • 金额:
    $ 4.68万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-08-01 至 2024-07-31
  • 项目状态:
    已结题

项目摘要

Abstract Analysis of RNA sequencing (RNASeq) data obtained from large patient cohorts can reveal transcriptomic perturbations that are associated with complex disease and facilitate the identification of disease subtypes. This is typically framed as an unsupervised learning task to discover latent structure in a matrix of RNASeq based quantification of gene expression or local splicing variations (LSVs). However, several factors make analysis of such heterogeneous data challenging. First, such datasets are comprised of samples processed at multiple institutions which might employ different sequencing protocols and quality control steps. This introduces confounding factors into the data like inconsistent sample quality or variable cell type proportions which can hinder detection of true biological signal. Second, in acute myeloid leukemia (AML), mutations in splice factor genes occurring in a subset of the patients may only result in alteration of a subset of coregulated splicing events. Thus, instead of measuring global similarity between samples based on all transcriptomic features, there is a need to efficiently identify “tiles”, defined by a subset of samples and splicing events with abnormal signals. Although several algorithms have been proposed for this task, they fail to overcome many of the computational challenges associated with modeling splicing data and are not well suited to handle missing values. To facilitate analysis of heterogeneous splicing datasets by reducing false positive discoveries and boosting true biological signal, we will first develop a model to correct for the effects of RNA degradation and cell type mixtures. Then in order to efficiently identify AML subtypes characterized by splicing events and account for splicing specific modeling challenges, we propose CHESSBOARD (Characterizing Heterogeneity of Expression and Splicing by Search for Blocks of Abnormalities and Outliers in RNA Datasets), a non- parametric Bayesian model for unsupervised discovery of tiles. We will apply our models to synthetic datasets and show it outperforms several baseline approaches. Next, we will show that it recovers tiles characterized by known and novel splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we will show that tiles discovered are correlated with drug response to therapeutics, pointing to the translational impact of our findings.
抽象的 从大型患者队列获得的RNA测序分析(RNASEQ)数据可以揭示转录组 与复杂疾病相关并促进疾病亚型鉴定的扰动。 这通常是在RNASEQ矩阵中发现潜在结构的无监督学习任务 基于基因表达或局部剪接变异(LSV)的定量。但是,有几个因素使 分析这种异质数据挑战。首先,此类数据集包括在处理的样本中 多个机构可能会雇用不同的测序协议和质量控制步骤。这 将混杂因素引入数据中,例如不一致的样本质量或可变单元格类型属性 这可以阻止检测真正的生物学信号。第二,在急性髓样白血病(AML)中,突变 剪接因子基因出现在一部分患者中,可能只会导致一部分的核心策略改变 剪接事件。这不是基于所有转录组的样品之间的全局相似性 功能,需要有效地识别“瓷砖”,这是由样本的子集和拼接事件定义的 异常信号。尽管已经提出了几种算法,但他们未能克服许多 与剪接数据建模相关的计算挑战,并且不适合处理丢失 值。 通过减少误报发现并提升,促进异质剪接数据集的分析 真正的生物学信号,我们将首先开发一个模型,以校正RNA降解和细胞类型的影响 混合物。然后,为了有效识别以拼接事件为特征的AML子类型并说明 我们提出了针对特定建模的挑战,我们提出了棋盘(表征异质性的 通过搜索RNA数据集中的异常和异常值的块来表达和剪接),一种非 - 无监督发现的参数贝叶斯模型。我们将把模型应用于合成数据集 并显示出优于几种基线方法。接下来,我们将证明它恢复了以特征为特征的瓷砖 已知且新颖的剪接畸变,可在多个AML患者队列中再现。最后,我们会的 表明发现的瓷砖与药物对治疗的反应相关,指向翻译 我们发现的影响。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

暂无数据

数据更新时间:2024-06-01

David Wang的其他基金

Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts
开发机器学习模型来分析大型异构队列中的拼接数据
  • 批准号:
    10672974
    10672974
  • 财政年份:
    2021
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts
开发机器学习模型来分析大型异构队列中的拼接数据
  • 批准号:
    10315802
    10315802
  • 财政年份:
    2021
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Neurodifferentiation/Stem Cell Unit
神经分化/干细胞单位
  • 批准号:
    10916077
    10916077
  • 财政年份:
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Neurodifferentiation/Stem Cell Unit
神经分化/干细胞单位
  • 批准号:
    10708659
    10708659
  • 财政年份:
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:

相似国自然基金

时空序列驱动的神经形态视觉目标识别算法研究
  • 批准号:
    61906126
  • 批准年份:
    2019
  • 资助金额:
    24.0 万元
  • 项目类别:
    青年科学基金项目
本体驱动的地址数据空间语义建模与地址匹配方法
  • 批准号:
    41901325
  • 批准年份:
    2019
  • 资助金额:
    22.0 万元
  • 项目类别:
    青年科学基金项目
大容量固态硬盘地址映射表优化设计与访存优化研究
  • 批准号:
    61802133
  • 批准年份:
    2018
  • 资助金额:
    23.0 万元
  • 项目类别:
    青年科学基金项目
IP地址驱动的多径路由及流量传输控制研究
  • 批准号:
    61872252
  • 批准年份:
    2018
  • 资助金额:
    64.0 万元
  • 项目类别:
    面上项目
针对内存攻击对象的内存安全防御技术研究
  • 批准号:
    61802432
  • 批准年份:
    2018
  • 资助金额:
    25.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Alternatively spliced cell surface proteins as drivers of leukemogenesis and targets for immunotherapy
选择性剪接的细胞表面蛋白作为白血病发生的驱动因素和免疫治疗的靶点
  • 批准号:
    10648346
    10648346
  • 财政年份:
    2023
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Prognostic implications of mitochondrial inheritance in myelodysplastic syndromes after stem-cell transplantation
干细胞移植后骨髓增生异常综合征线粒体遗传的预后意义
  • 批准号:
    10662946
    10662946
  • 财政年份:
    2023
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Regulation and Manipulation of Innate Immunity During HIV Infection
HIV 感染期间先天免疫的调节和操纵
  • 批准号:
    10874020
    10874020
  • 财政年份:
    2023
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Multi-functional cellular therapies to overcome tumor heterogeneity and limit toxicity in acute myeloid leukemia
多功能细胞疗法克服肿瘤异质性并限制急性髓系白血病的毒性
  • 批准号:
    10679763
    10679763
  • 财政年份:
    2023
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别:
Investigating the mechanism of SHP2 and BCL2 Inhibition in Acute Myeloid Leukemia (AML)
研究急性髓系白血病 (AML) 中 SHP2 和 BCL2 抑制的机制
  • 批准号:
    10736325
    10736325
  • 财政年份:
    2023
  • 资助金额:
    $ 4.68万
    $ 4.68万
  • 项目类别: