Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts
开发机器学习模型来分析大型异构队列中的拼接数据
基本信息
- 批准号:10506326
- 负责人:
- 金额:$ 4.68万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-08-01 至 2024-07-31
- 项目状态:已结题
- 来源:
- 关键词:Acute Myelocytic LeukemiaAddressAffectAftercareAlgorithmsAlternative SplicingB-Cell Acute Lymphoblastic LeukemiaBayesian ModelingBiologicalBlast CellCancer PatientCaringCatalogsCellsCharacteristicsClinicCodeComplexComputer softwareComputing MethodologiesDataData SetDetectionDiseaseEventExcisionFollow-Up StudiesGene ExpressionGenesGeneticGoalsHematologic NeoplasmsHeterogeneityIndividualInstitutionLettersMalignant NeoplasmsMasksMeasuresMethodsMinorityMissense MutationModelingModificationMultiomic DataMutationPatientsPharmaceutical PreparationsProcessPrognostic MarkerProtocols documentationQuality ControlRNARNA DegradationRNA SplicingRNA analysisRelapseReproducibilityResourcesReverse Transcriptase Polymerase Chain ReactionSamplingSignal TransductionSourceStatistical ModelsStructureTechniquesThe Cancer Genome AtlasTherapeuticTimeTissue ProcurementsTrainingValidationVariantXenograft procedureacute carebasebiobankbioinformatics toolcell typeclinically relevantcohortcomputerized toolsdata integrationdisease phenotypedisorder subtypedrug sensitivityexperienceheterogenous dataimprovedleukemialeukemogenesismachine learning modelmultiple data sourcesmultiple omicsnew therapeutic targetnon-Gaussian modelnovelpatient subsetspersonalized medicineprecision medicineprognostic toolresponsetooltranscriptome sequencingtranscriptomicstranslational impactunsupervised learning
项目摘要
Abstract
Analysis of RNA sequencing (RNASeq) data obtained from large patient cohorts can reveal transcriptomic
perturbations that are associated with complex disease and facilitate the identification of disease subtypes.
This is typically framed as an unsupervised learning task to discover latent structure in a matrix of RNASeq
based quantification of gene expression or local splicing variations (LSVs). However, several factors make
analysis of such heterogeneous data challenging. First, such datasets are comprised of samples processed at
multiple institutions which might employ different sequencing protocols and quality control steps. This
introduces confounding factors into the data like inconsistent sample quality or variable cell type proportions
which can hinder detection of true biological signal. Second, in acute myeloid leukemia (AML), mutations in
splice factor genes occurring in a subset of the patients may only result in alteration of a subset of coregulated
splicing events. Thus, instead of measuring global similarity between samples based on all transcriptomic
features, there is a need to efficiently identify “tiles”, defined by a subset of samples and splicing events with
abnormal signals. Although several algorithms have been proposed for this task, they fail to overcome many of
the computational challenges associated with modeling splicing data and are not well suited to handle missing
values.
To facilitate analysis of heterogeneous splicing datasets by reducing false positive discoveries and boosting
true biological signal, we will first develop a model to correct for the effects of RNA degradation and cell type
mixtures. Then in order to efficiently identify AML subtypes characterized by splicing events and account for
splicing specific modeling challenges, we propose CHESSBOARD (Characterizing Heterogeneity of
Expression and Splicing by Search for Blocks of Abnormalities and Outliers in RNA Datasets), a non-
parametric Bayesian model for unsupervised discovery of tiles. We will apply our models to synthetic datasets
and show it outperforms several baseline approaches. Next, we will show that it recovers tiles characterized by
known and novel splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we will
show that tiles discovered are correlated with drug response to therapeutics, pointing to the translational
impact of our findings.
抽象的
对从大型患者队列中获得的 RNA 测序 (RNASeq) 数据进行分析可以揭示转录组学
与复杂疾病相关并有助于疾病亚型识别的扰动。
这通常被视为一项无监督学习任务,旨在发现 RNASeq 矩阵中的潜在结构
然而,有几个因素会影响基因表达或局部剪接变异(LSV)的定量。
首先,此类数据集由处理的样本组成。
多个机构可能采用不同的测序方案和质量控制步骤。
将混杂因素引入数据中,例如不一致的样本质量或可变的细胞类型比例
其次,在急性髓系白血病 (AML) 中,突变会阻碍真正的生物信号的检测。
发生在一部分患者中的剪接因子基因可能只会导致一部分共调节基因的改变
因此,不是根据所有转录组来测量样本之间的全局相似性。
功能,需要有效地识别“图块”,由样本子集和拼接事件定义
尽管已经针对此任务提出了几种算法,但它们未能克服许多问题。
与拼接数据建模相关的计算挑战,并且不太适合处理缺失
价值观。
通过减少误报发现和提升来促进异质剪接数据集的分析
真正的生物信号,我们将首先开发一个模型来纠正 RNA 降解和细胞类型的影响
然后为了有效地识别以剪接事件为特征的 AML 亚型并解释。
拼接特定的建模挑战,我们提出了 CHESSBOARD(Characterizing Heterogeneity of
通过在 RNA 数据集中搜索异常块和异常值来进行表达和剪接),一种非
用于无监督发现图块的参数贝叶斯模型我们将把我们的模型应用于合成数据集。
并证明它优于几种基线方法 接下来,我们将证明它可以恢复具有以下特征的图块。
最后,我们将研究在多个 AML 患者群体中可重现的已知和新颖的剪接畸变。
显示发现的瓷砖与药物对治疗的反应相关,指出转化
我们的研究结果的影响。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
David Wang其他文献
David Wang的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('David Wang', 18)}}的其他基金
Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts
开发机器学习模型来分析大型异构队列中的拼接数据
- 批准号:
10672974 - 财政年份:2021
- 资助金额:
$ 4.68万 - 项目类别:
Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts
开发机器学习模型来分析大型异构队列中的拼接数据
- 批准号:
10315802 - 财政年份:2021
- 资助金额:
$ 4.68万 - 项目类别:
相似国自然基金
本体驱动的地址数据空间语义建模与地址匹配方法
- 批准号:41901325
- 批准年份:2019
- 资助金额:22.0 万元
- 项目类别:青年科学基金项目
时空序列驱动的神经形态视觉目标识别算法研究
- 批准号:61906126
- 批准年份:2019
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
针对内存攻击对象的内存安全防御技术研究
- 批准号:61802432
- 批准年份:2018
- 资助金额:25.0 万元
- 项目类别:青年科学基金项目
大容量固态硬盘地址映射表优化设计与访存优化研究
- 批准号:61802133
- 批准年份:2018
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
IP地址驱动的多径路由及流量传输控制研究
- 批准号:61872252
- 批准年份:2018
- 资助金额:64.0 万元
- 项目类别:面上项目
相似海外基金
Clonal hematopoiesis and inherited genetic variation in sickle cell disease
镰状细胞病的克隆造血和遗传变异
- 批准号:
10638404 - 财政年份:2023
- 资助金额:
$ 4.68万 - 项目类别:
Synthetic introns for selective targeting of RNA splicing factor-mutant leukemia
用于选择性靶向RNA剪接因子突变型白血病的合成内含子
- 批准号:
10722782 - 财政年份:2023
- 资助金额:
$ 4.68万 - 项目类别:
Alternatively spliced cell surface proteins as drivers of leukemogenesis and targets for immunotherapy
选择性剪接的细胞表面蛋白作为白血病发生的驱动因素和免疫治疗的靶点
- 批准号:
10648346 - 财政年份:2023
- 资助金额:
$ 4.68万 - 项目类别:
Prognostic implications of mitochondrial inheritance in myelodysplastic syndromes after stem-cell transplantation
干细胞移植后骨髓增生异常综合征线粒体遗传的预后意义
- 批准号:
10662946 - 财政年份:2023
- 资助金额:
$ 4.68万 - 项目类别:
Prognostic implications of mitochondrial inheritance in myelodysplastic syndromes after stem-cell transplantation
干细胞移植后骨髓增生异常综合征线粒体遗传的预后意义
- 批准号:
10662946 - 财政年份:2023
- 资助金额:
$ 4.68万 - 项目类别: