Significance Based Procedures for Mining and Prediction of Large Data Sets

基于显着性的大数据集挖掘和预测程序

基本信息

批准号：
0907177
负责人：
Andrew Nobel
金额：
$ 21万
依托单位：
University of North Carolina at Chapel Hill
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2009
资助国家：
美国
起止时间：
2009-09-01 至 2013-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0907177&HistoricalAwards=false
关键词：
Significance Based Procedures Mining Prediction

项目摘要

Exploratory methods play a critical role in the understanding of large data sets, regardless of their origin, and are typically the first step in their analysis. The investigator is studying the development and use of exploratory, data-mining methods that identify patterns or regularities in high-dimensional data. The specific focus of his research is the problem of identifying sample-variable associations in large data sets that may arise from multiple measurement technologies. In the typical case where the data from an experiment are represented in the form of a rectangular matrix, sample-variable associations correspond to distinguished submatrices of the data matrix. The investigator is developing a statistically principled, significance-based approach to the problem of finding large average submatrices of a data matrix, using a simple iterative algorithm. The algorithm is applicable to real-valued and categorical data matrices. In addition to the basic method, the investigator is developing several extensions, including data-driven null models that incorporate dependence between variables, data arising from the simultaneous application of multiple measurement technologies, and application of the basic method to prediction problems such as classification, regression and survival analysis. In addition, the investigator is developing basic theory to support the use of the algorithm, and to assess the structure of data matrices under the different null models. The development and application of the methods is being carried out in close collaboration with several groups of biomedical researchers. In particular, the new data mining methodology is being incorporated into software that is used by collaborating scientists to identify and assess significant sample-variable associations in ongoing experiments involving breast, brain and lung cancer.Large data sets are now common in many experimental areas of science, and in particular gene-level studies of human diseases such as cancer. In such studies it is not unusual to encounter experiments containing from hundreds to thousands of samples, and tens of thousands to millions of measurements on each sample. Large data sets are part of a trend away from traditional hypothesis-driven scientific research towards data-driven research, in which researchers explore large data sets for patterns or regularities that, in conjunction with subject matter expertise, yield hypotheses that can be tested by more traditional means. The investigator is studying an exploratory method that identifies statistically significant associations between samples and variables in large data sets, associations that can yield testable scientific hypotheses. The methods being developed by the investigator are computationally efficient, and are based on established statistical principles, in particular the notion of statistical significance. The investigator is also studying ways in which the basic exploratory method can be applied to data arising from multiple measurement technologies, and application of the basic method to statistical problems such as classification and survival analysis. These activities are being carried out as part of a collaborative research program involving the sustained interactions of faculty and students from the statistical, biological, and medical sciences. The exploratory method developed by the investigator is being integrated into the basic exploratory tools of the collaborating scientists, and is a component in the analysis of several new, previously unanalyzed, data sets.

探索方法在对大型数据集的理解中起着至关重要的作用，无论其起源如何，通常是他们分析的第一步。研究人员正在研究探索性，数据挖掘方法的开发和使用，这些方法识别高维数据中的模式或规律性。他的研究的具体重点是在大型数据集中识别样本变量关联的问题，这些数据集可能来自多种测量技术。在典型的情况下，来自实验的数据以矩形矩阵的形式表示，样品变量的关联对应于数据矩阵的杰出子膜。研究人员正在使用简单的迭代算法开发一种基于统计的基于意义的方法，即用于查找数据矩阵的大量平均子膜片的问题。该算法适用于实价和分类数据矩阵。除了基本方法外，研究者还开发了几个扩展，包括数据驱动的无效模型，这些模型包括变量之间的依赖性，数据是由多个测量技术的同时应用以及将基本方法应用于分类，回归，回归和生存分析等预测问题的数据。此外，研究者正在开发基本理论，以支持算法的使用，并评估不同空模型下数据矩阵的结构。这些方法的开发和应用正在与几组生物医学研究人员密切合作。特别是，新的数据挖掘方法已被纳入软件中，该方法是由科学家协作来识别和评估涉及乳腺癌，脑和肺癌正在进行的实验中的重要样本变量关联的。LARARGE数据集在科学的许多实验领域，尤其是对人类疾病的基因级研究，例如人类疾病，例如癌症。在此类研究中，遇到包含数百到数千个样本的实验以及每个样本的数万到数百万的测量结果并不罕见。大型数据集是从传统假设驱动的科学研究到数据驱动研究的趋势的一部分，在该研究中，研究人员探索了模式或规律性的大型数据集，这些模式或规律性与主题专业知识一起产生了可以通过更传统手段进行检验的假设。研究人员正在研究一种探索方法，该方法可以识别大型数据集中样本与变量之间具有统计学意义的关联，可以产生可检验的科学假设的关联。研究人员开发的方法在计算上是有效的，并且基于既定的统计原理，尤其是统计显着性的概念。研究人员还研究了可以将基本探索方法应用于多种测量技术引起的数据，以及将基本方法应用于统计问题，例如分类和生存分析。这些活动是作为合作研究计划的一部分进行的，该计划涉及统计，生物学和医学科学的教师和学生的持续互动。研究人员开发的探索方法正在集成到协作科学家的基本探索工具中，并且是分析几个新的，以前未分析的数据集的组成部分。