High-dimensional unsupervised learning, with applications to genomics
高维无监督学习及其在基因组学中的应用
基本信息
- 批准号:8708556
- 负责人:
- 金额:$ 36.38万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2011
- 资助国家:美国
- 起止时间:2011-09-20 至 2016-08-31
- 项目状态:已结题
- 来源:
- 关键词:AddressBiologicalBiological AssayCancer PatientCategoriesClassificationComputer softwareCopy Number PolymorphismDNA SequenceDNA copy numberDataData ElementData SetDatabasesDevelopmentDiseaseDrug FormulationsElementsGene ExpressionGene Expression ProfileGenerationsGenesGenomicsGlioblastomaGoalsGraphIndividualJointsLeadLearningLeftMalignant NeoplasmsMalignant neoplasm of lungMeasurementMedicineMethodologyMethodsMethylationMiningModelingNetwork-basedNoiseNormal Statistical DistributionNormal tissue morphologyOther GeneticsOutcomePathway interactionsPatientsPatternPerformancePrincipal Component AnalysisRNA SequencesResearchSamplingSignal TransductionSingle Nucleotide PolymorphismStatistical MethodsStatistical ModelsStem cellsStructureTechniquesTechnologyTestingTimeTumor SubtypeValidationValidity of Resultsbasecancer therapycancer typeclinically relevantdisorder subtypeflexibilitygene discoveryleiomyosarcomamalignant breast neoplasmnovel strategiesoutcome forecastresponsestem cell biologystem cell populationtooltumor
项目摘要
PROJECT SUMMARY. This project involves the development of statistical methodology for the analysis of large-
scale genomic data, such as gene expression, DNA copy number, and DNA sequencing data. In genomic studies,
the goal is often to identify signal in the data in an unsupervised way. For instance, given the gene expression
measurements for a set of patients with lung cancer, one might wish to discover previously unknown lung cancer
subtypes that are characterized by distinct gene expression signatures and that might differ with respect to prognosis
or response to therapy. However, the search for signal in genomic data is made difficult by the fact that the number
of variables (e.g. genes) is generally orders of magnitude greater than the number of observations (e.g. lung cancer
patients). As a result, principled methods must be developed to discover signal without overfitting. Furthermore,
there is a need for objective ways to assess the validity of results obtained.
This proposal has four specific aims, each of which involves the development of a new statistical method for
solving a problem that arises in the analysis of genomic data. Aim 1: A method to learn multiple related genomic
networks at once. For instance, one might expect that the gene expression networks for cancer and normal tissues
will look similar to each other, with certain specific differences. The current proposal will provide a way to learn both
networks simultaneously, in order to identify gene pathways that are perturbed in cancer. The proposed approach
involves applying shrinkage penalties to the Gaussian graphical model formulation for network estimation. Aim 2: A
principled approach for simultaneously clustering the rows and columns of a data matrix (e.g. patients and genes).
The standard approach for discovering signal in genomic data involves clustering rows and columns independently,
but the proposed approach will have increased power to discover biologically relevant clusters. The proposed
approach involves applying shrinkage penalties to the matrix-variate normal distribution. Aim 3: A tool for the
integrative analysis of multiple genomic data types collected on a single set of patient samples. For instance, if
gene expression data, copy number data, and methylation data are collected for a single set of samples, then this
will allow for the discovery of subsets of patients that are characterized by particular signatures of gene expression,
copy number variation, and methylation. This could lead to the discovery of clinically relevant subtypes of cancer
and other diseases. The proposed approach is an extension of the approach described in Aim 2. Aim 4: A flexible
framework for the validation of clusters discovered in structured genomic data, such as DNA copy number and single
nucleotide polymorphism data, in order to determine whether clusters discovered reflect signal or simply noise. The
proposed approach is related to cross-validation, and will be extended to develop a method for the validation of other
unsupervised statistical tools, such as those described in Aims 1-3 above.
The statistical tools that result from the proposed research will be implemented in freely available software.
项目摘要。该项目涉及开发统计方法,用于分析大型
比例基因组数据,例如基因表达,DNA拷贝数和DNA测序数据。在基因组研究中,
该目标通常是以无监督的方式识别数据中的信号。例如,给定基因表达
一组肺癌患者的测量,人们可能希望发现以前未知的肺癌
以独特的基因表达特征的亚型,在预后方面可能有所不同
或对治疗的反应。但是,基因组数据中寻找信号的搜索很难通过数字
变量(例如基因)的数量级通常大于观察次数(例如肺癌
患者)。结果,必须开发有原则的方法以发现信号而不过度拟合。此外,
需要客观的方法来评估所获得的结果的有效性。
该提议具有四个具体目标,每个目标都涉及开发一种新的统计方法
解决基因组数据分析中出现的问题。目标1:一种学习多个相关基因组的方法
一次网络。例如,人们可能期望癌症和正常组织的基因表达网络
看起来彼此相似,并且存在某些具体差异。当前的建议将提供一种学习两者的方法
网络同时为了鉴定癌症干扰的基因途径。提出的方法
涉及将收缩惩罚应用于高斯图形模型公式以进行网络估计。目标2:
同时聚集数据矩阵的行和列的原则方法(例如,患者和基因)。
在基因组数据中发现信号的标准方法涉及独立聚类行和列,
但是拟议的方法将增加发现与生物学相关簇的能力。提议
方法涉及将收缩惩罚应用于矩阵变化的正态分布。目标3:一种工具
对单个患者样本收集的多种基因组数据类型的综合分析。例如,如果
为一组样品收集基因表达数据,拷贝数数据和甲基化数据,然后收集
将允许发现以基因表达的特定特征的患者的子集,
拷贝数变化和甲基化。这可能导致发现癌症的临床相关亚型
和其他疾病。提出的方法是AIM 2中描述的方法的扩展。目标4:灵活
验证在结构化基因组数据中发现的簇的框架,例如DNA拷贝数和单个
核苷酸多态性数据,以确定簇是否发现了信号还是简单的噪声。这
建议的方法与交叉验证有关,并将扩展以开发一种验证其他方法的方法
无监督的统计工具,例如上面的目标1-3中描述的工具。
拟议研究产生的统计工具将在免费提供的软件中实施。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Daniela Witten其他文献
Daniela Witten的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Daniela Witten', 18)}}的其他基金
CRCNS: Theory and Experiments to Elucidate Neural Coding in the Reward Circuit
CRCNS:阐明奖励回路中神经编码的理论和实验
- 批准号:
10225560 - 财政年份:2018
- 资助金额:
$ 36.38万 - 项目类别:
High-dimensional unsupervised learning, with applications to genomics
高维无监督学习及其在基因组学中的应用
- 批准号:
8537224 - 财政年份:2011
- 资助金额:
$ 36.38万 - 项目类别:
High-dimensional unsupervised learning, with applications to genomics
高维无监督学习及其在基因组学中的应用
- 批准号:
8212779 - 财政年份:2011
- 资助金额:
$ 36.38万 - 项目类别:
High-dimensional unsupervised learning, with applications to genomics
高维无监督学习及其在基因组学中的应用
- 批准号:
8335437 - 财政年份:2011
- 资助金额:
$ 36.38万 - 项目类别:
相似国自然基金
DGT原位测定全氟辛酸的生物污损效应及其影响机制研究
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
DGT原位测定全氟辛酸的生物污损效应及其影响机制研究
- 批准号:42207312
- 批准年份:2022
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
集成微流控芯片应用于高通量精准生物检体测定
- 批准号:
- 批准年份:2020
- 资助金额:60 万元
- 项目类别:面上项目
硫酸盐还原菌生物膜活性的原位快速测定研究
- 批准号:41876101
- 批准年份:2018
- 资助金额:62.0 万元
- 项目类别:面上项目
冬虫夏草抗菌肽的序列测定及其生物学功能研究
- 批准号:81803848
- 批准年份:2018
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
相似海外基金
A HUMAN IPSC-BASED ORGANOID PLATFORM FOR STUDYING MATERNAL HYPERGLYCEMIA-INDUCED CONGENITAL HEART DEFECTS
基于人体 IPSC 的类器官平台,用于研究母亲高血糖引起的先天性心脏缺陷
- 批准号:
10752276 - 财政年份:2024
- 资助金额:
$ 36.38万 - 项目类别:
Strategies for next-generation flavivirus vaccine development
下一代黄病毒疫苗开发策略
- 批准号:
10751480 - 财政年份:2024
- 资助金额:
$ 36.38万 - 项目类别:
Decoding AMPK-dependent regulation of DNA methylation in lung cancer
解码肺癌中 DNA 甲基化的 AMPK 依赖性调节
- 批准号:
10537799 - 财政年份:2023
- 资助金额:
$ 36.38万 - 项目类别:
Molecular basis of glycan recognition by T and B cells
T 和 B 细胞识别聚糖的分子基础
- 批准号:
10549648 - 财政年份:2023
- 资助金额:
$ 36.38万 - 项目类别: