Significance Based Procedures for Mining and Prediction of Large Data Sets

基于显着性的大数据集挖掘和预测程序

基本信息

  • 批准号:
    0907177
  • 负责人:
  • 金额:
    $ 21万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2009
  • 资助国家:
    美国
  • 起止时间:
    2009-09-01 至 2013-08-31
  • 项目状态:
    已结题

项目摘要

Exploratory methods play a critical role in the understanding of large data sets, regardless of their origin, and are typically the first step in their analysis. The investigator is studying the development and use of exploratory, data-mining methods that identify patterns or regularities in high-dimensional data. The specific focus of his research is the problem of identifying sample-variable associations in large data sets that may arise from multiple measurement technologies. In the typical case where the data from an experiment are represented in the form of a rectangular matrix, sample-variable associations correspond to distinguished submatrices of the data matrix. The investigator is developing a statistically principled, significance-based approach to the problem of finding large average submatrices of a data matrix, using a simple iterative algorithm. The algorithm is applicable to real-valued and categorical data matrices. In addition to the basic method, the investigator is developing several extensions, including data-driven null models that incorporate dependence between variables, data arising from the simultaneous application of multiple measurement technologies, and application of the basic method to prediction problems such as classification, regression and survival analysis. In addition, the investigator is developing basic theory to support the use of the algorithm, and to assess the structure of data matrices under the different null models. The development and application of the methods is being carried out in close collaboration with several groups of biomedical researchers. In particular, the new data mining methodology is being incorporated into software that is used by collaborating scientists to identify and assess significant sample-variable associations in ongoing experiments involving breast, brain and lung cancer.Large data sets are now common in many experimental areas of science, and in particular gene-level studies of human diseases such as cancer. In such studies it is not unusual to encounter experiments containing from hundreds to thousands of samples, and tens of thousands to millions of measurements on each sample. Large data sets are part of a trend away from traditional hypothesis-driven scientific research towards data-driven research, in which researchers explore large data sets for patterns or regularities that, in conjunction with subject matter expertise, yield hypotheses that can be tested by more traditional means. The investigator is studying an exploratory method that identifies statistically significant associations between samples and variables in large data sets, associations that can yield testable scientific hypotheses. The methods being developed by the investigator are computationally efficient, and are based on established statistical principles, in particular the notion of statistical significance. The investigator is also studying ways in which the basic exploratory method can be applied to data arising from multiple measurement technologies, and application of the basic method to statistical problems such as classification and survival analysis. These activities are being carried out as part of a collaborative research program involving the sustained interactions of faculty and students from the statistical, biological, and medical sciences. The exploratory method developed by the investigator is being integrated into the basic exploratory tools of the collaborating scientists, and is a component in the analysis of several new, previously unanalyzed, data sets.
探索方法在对大型数据集的理解中起着至关重要的作用,无论其起源如何,通常是他们分析的第一步。研究人员正在研究探索性,数据挖掘方法的开发和使用,这些方法识别高维数据中的模式或规律性。他的研究的具体重点是在大型数据集中识别样本变量关联的问题,这些数据集可能来自多种测量技术。在典型的情况下,来自实验的数据以矩形矩阵的形式表示,样品变量的关联对应于数据矩阵的杰出子膜。研究人员正在使用简单的迭代算法开发一种基于统计的基于意义的方法,即用于查找数据矩阵的大量平均子膜片的问题。该算法适用于实价和分类数据矩阵。除了基本方法外,研究者还开发了几个扩展,包括数据驱动的无效模型,这些模型包括变量之间的依赖性,数据是由多个测量技术的同时应用以及将基本方法应用于分类,回归,回归和生存分析等预测问题的数据。此外,研究者正在开发基本理论,以支持算法的使用,并评估不同空模型下数据矩阵的结构。这些方法的开发和应用正在与几组生物医学研究人员密切合作。特别是,新的数据挖掘方法已被纳入软件中,该方法是由科学家协作来识别和评估涉及乳腺癌,脑和肺癌正在进行的实验中的重要样本变量关联的。LARARGE数据集在科学的许多实验领域,尤其是对人类疾病的基因级研究,例如人类疾病,例如癌症。在此类研究中,遇到包含数百到数千个样本的实验以及每个样本的数万到数百万的测量结果并不罕见。大型数据集是从传统假设驱动的科学研究到数据驱动研究的趋势的一部分,在该研究中,研究人员探索了模式或规律性的大型数据集,这些模式或规律性与主题专业知识一起产生了可以通过更传统手段进行检验的假设。研究人员正在研究一种探索方法,该方法可以识别大型数据集中样本与变量之间具有统计学意义的关联,可以产生可检验的科学假设的关联。 研究人员开发的方法在计算上是有效的,并且基于既定的统计原理,尤其是统计显着性的概念。 研究人员还研究了可以将基本探索方法应用于多种测量技术引起的数据,以及将基本方法应用于统计问题,例如分类和生存分析。 这些活动是作为合作研究计划的一部分进行的,该计划涉及统计,生物学和医学科学的教师和学生的持续互动。研究人员开发的探索方法正在集成到协作科学家的基本探索工具中,并且是分析几个新的,以前未分析的数据集的组成部分。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Andrew Nobel其他文献

Andrew Nobel的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Andrew Nobel', 18)}}的其他基金

Inference for Stationary Processes: Optimal Transport and Generalized Bayesian Approaches
平稳过程的推理:最优传输和广义贝叶斯方法
  • 批准号:
    2113676
  • 财政年份:
    2021
  • 资助金额:
    $ 21万
  • 项目类别:
    Standard Grant
Iterative testing procedures and high-dimensional scaling limits of extremal random structures
迭代测试程序和极值随机结构的高维缩放限制
  • 批准号:
    1613072
  • 财政年份:
    2016
  • 资助金额:
    $ 21万
  • 项目类别:
    Continuing Grant
Optimality Landscapes and Exploratory Data Analysis
最优性景观和探索性数据分析
  • 批准号:
    1310002
  • 财政年份:
    2013
  • 资助金额:
    $ 21万
  • 项目类别:
    Standard Grant
Analysis of High Dimensional Data Using Subspace Clustering
使用子空间聚类分析高维数据
  • 批准号:
    0406361
  • 财政年份:
    2004
  • 资助金额:
    $ 21万
  • 项目类别:
    Continuing Grant
Estimation from Dynamical Systems and Individual Sequences
动力系统和个体序列的估计
  • 批准号:
    9971964
  • 财政年份:
    1999
  • 资助金额:
    $ 21万
  • 项目类别:
    Standard Grant
Mathematical Sciences: Greedy Growing and its Applications
数学科学:贪婪增长及其应用
  • 批准号:
    9501926
  • 财政年份:
    1995
  • 资助金额:
    $ 21万
  • 项目类别:
    Continuing Grant

相似国自然基金

基于RNA m6A甲基化调控细胞程序性坏死探讨缺硒鸡动脉损伤机制的研究
  • 批准号:
    32372967
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
基于溶酶体组织蛋白酶泄漏诱发不同形式的细胞程序性死亡探讨糖尿病肾病“毒损肾络”的内在机制
  • 批准号:
    82374382
  • 批准年份:
    2023
  • 资助金额:
    48 万元
  • 项目类别:
    面上项目
基于脉宽调制的稀土基正交发射纳米载体介导基因程序性跨膜可控递送研究
  • 批准号:
    52372148
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
开发基于平面波基组结合非平衡态格林函数方法的量子输运程序
  • 批准号:
    22303098
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于RNA催化发夹组装的人工信号通路策略用于控制多组响应型CRISPR/dCas9转录程序
  • 批准号:
    22304052
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Multilevel investigation of uncertain and reclassified genomic variants in clinical oncology
临床肿瘤学中不确定和重新分类的基因组变异的多层次研究
  • 批准号:
    10640387
  • 财政年份:
    2021
  • 资助金额:
    $ 21万
  • 项目类别:
Multilevel investigation of uncertain and reclassified genomic variants in clinical oncology
临床肿瘤学中不确定和重新分类的基因组变异的多层次研究
  • 批准号:
    10705219
  • 财政年份:
    2021
  • 资助金额:
    $ 21万
  • 项目类别:
New quantitative approaches to interpret variant pathogenicity
解释变异致病性的新定量方法
  • 批准号:
    10301093
  • 财政年份:
    2021
  • 资助金额:
    $ 21万
  • 项目类别:
Artificial Intelligence Enabled Multi-Spectral Autofluorescence Imaging for Real-time Determination of Muscle in Bladder Tumor During Resection
人工智能支持多光谱自发荧光成像,可在切除过程中实时确定膀胱肿瘤中的肌肉
  • 批准号:
    10325131
  • 财政年份:
    2021
  • 资助金额:
    $ 21万
  • 项目类别:
Global significance test based on quantile regression with applications to genomic studies of Alzheimer’s disease
基于分位数回归的全局显着性检验及其在阿尔茨海默病基因组研究中的应用
  • 批准号:
    10303743
  • 财政年份:
    2021
  • 资助金额:
    $ 21万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了