Collaborative Research: Efficient Parallel Iterative Monte Carlo Methods for Statistical Analysis of Big Data

合作研究：用于大数据统计分析的高效并行迭代蒙特卡罗方法

基本信息

批准号：
1317131
负责人：
Faming Liang
金额：
$ 22万
依托单位：
Texas A&M University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2013
资助国家：
美国
起止时间：
2013-08-01 至 2015-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1317131&HistoricalAwards=false
关键词：
Collaborative Research Efficient Parallel Iterative

项目摘要

The integration of computer technology into science and daily life has enabled the collection of massive volumes of data. To analyze these data, one may have to resort to parallel and distributed architectures. While the parallel and distributed architectures present new capabilities for storage and manipulation of big data, it is unclear, from the inferential point of view, how the current statistical methodology can be transported to the paradigm of big data. Also, growing data size typically comes together with a growing complexity of data structures and of the models needed to account for the structures. Although iterative Monte Carlo algorithms, such as the Markov chain Monte Carlo (MCMC), stochastic approximation, and expectation-maximization (EM) algorithms, have proven to be very powerful and typically unique computational tools for analyzing data of complex structures, they are infeasible for big data as for which a large number of iterations and a complete scan of the full dataset for each iteration are typically required. Big data have put a great challenge on the current statistical methodology. The investigators propose a general principle for developing Monte Carlo algorithms that are feasible for big data and workable on parallel and distributed architectures; that is, using Monte Carlo averages calculated in parallel from subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle avoids the requirement for repeated scans of full data in algorithm iterations, while enabling the algorithm to produce statistically sensible solutions to the problem under consideration. Under this principle, a general algorithm, the so-called subsampling approximation-based parallel stochastic approximation algorithm, is proposed for parameter estimation for big data problems. Unlike the existing algorithms, such as the bag of little bootstraps, aggregated estimation equation, and split-and-conquer algorithms, the proposed algorithm works for the problems for which the observations are generally dependent. Under the same principle, a subsampling approximation-based parallel Metropolis-Hastings algorithm is proposed for Bayesian analysis of big data, and a subsampling approximation-based parallel Monte Carlo EM algorithm is proposed for parameter estimation for the big data problems with missing observations. In addition to the subsampling approximation-based parallel iterative Monte Carlo algorithms, an embarrassingly parallel MCMC algorithm is proposed for Bayesian analysis of big data based on the popular idea of divide-and-conquer. Various schemes of dataset partition and results aggregation are proposed. The validity of the proposed parallel iterative Monte Carlo algorithms, including both the subsampling approximation-based and embarrassingly parallel ones, will be rigorously studied. The proposed algorithms will be applied to spatio-temporal modeling of satellite climate data, genome-wide association study, and stream data analysis.The intellectual merit of this project is to propose a general principle for statistical analysis of big data: Using Monte Carlo averages of subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle provides a general strategy for transporting the current statistical methodology to the paradigm of big data. Under this principle, a few subsampling approximation-based parallel iterative Monte Carlo algorithms are proposed. The proposed algorithms address the core problem of big data analysis:how to make a statistically sensible analysis for big data while avoiding repeated scans of the full dataset? This project will have broader impacts because big data are ubiquitous throughout almost all fields of science and technology. A successful research program in theory and methods of parallel iterative Monte Carlo computations can have immense benefit widely throughout science and technology. The research results will be disseminated to the communities of interest, such as atmospheric science, biomedical science, engineering, and social science, via direct collaboration with researchers in these disciplines, conference presentations, books, and papers to be published in academic journals. The project will have also significant impacts on education through direct involvement of graduate students in the project and incorporation of results into undergraduate and graduate courses. In addition, the package Distributed Iterative Statistical Computing (DISC) that will be developed under this project is designed to provide a platform for Ph.D. students and researchers like the investigators with network-connected computers to experiment new ideas of developing efficient iterative Monte Carlo algorithms in parallel or, more exactly, grid computing environments.

计算机技术与科学和日常生活的结合使得海量数据的收集成为可能。为了分析这些数据，人们可能不得不求助于并行和分布式架构。虽然并行和分布式架构提供了大数据存储和操作的新能力，但从推理的角度来看，目前尚不清楚如何将当前的统计方法转移到大数据范式。此外，数据规模的增长通常伴随着数据结构和解释结构所需的模型的复杂性的增长。尽管迭代蒙特卡罗算法，例如马尔可夫链蒙特卡罗 (MCMC)、随机逼近和期望最大化 (EM) 算法，已被证明是非常强大且通常是分析复杂结构数据的独特计算工具，但它们是不可行的对于大数据，通常需要大量迭代以及每次迭代对完整数据集的完整扫描。大数据对当前的统计方法提出了巨大的挑战。研究人员提出了开发蒙特卡罗算法的一般原则，该算法适用于大数据并且可在并行和分布式架构上运行；也就是说，使用从子样本并行计算的蒙特卡洛平均值来近似最初需要从完整数据集计算的数量。这一原则避免了在算法迭代中重复扫描完整数据的要求，同时使算法能够为所考虑的问题生成统计上合理的解决方案。在此原则下，提出了一种通用算法，即所谓的基于子采样近似的并行随机近似算法，用于大数据问题的参数估计。与现有算法（例如小引导包、聚合估计方程和分治算法）不同，该算法适用于通常依赖于观测值的问题。基于同样的原理，提出了一种基于子采样近似的并行Metropolis-Hastings算法用于大数据的贝叶斯分析，并提出了一种基于子采样近似的并行Monte Carlo EM算法用于大数据缺失观测问题的参数估计。除了基于下采样近似的并行迭代蒙特卡罗算法之外，基于流行的分而治之思想，提出了一种用于大数据贝叶斯分析的尴尬并行MCMC算法。提出了各种数据集划分和结果聚合方案。我们将严格研究所提出的并行迭代蒙特卡罗算法的有效性，包括基于子采样近似的算法和令人尴尬的并行算法。所提出的算法将应用于卫星气候数据的时空建模、全基因组关联研究和流数据分析。该项目的智力价值在于提出了大数据统计分析的通用原则：使用蒙特卡罗平均值子样本的数量来近似最初需要从完整数据集计算的数量。这一原则提供了将当前统计方法转移到大数据范式的总体策略。在此原则下，提出了几种基于子采样近似的并行迭代蒙特卡罗算法。所提出的算法解决了大数据分析的核心问题：如何对大数据进行统计上合理的分析，同时避免重复扫描完整数据集？该项目将产生更广泛的影响，因为大数据几乎遍及所有科学技术领域。并行迭代蒙特卡罗计算的理论和方法的成功研究项目可以在整个科学和技术领域产生广泛的巨大益处。研究成果将通过与这些学科的研究人员的直接合作、会议报告、书籍和在学术期刊上发表的论文，传播给大气科学、生物医学科学、工程学和社会科学等感兴趣的社区。该项目还将通过研究生直接参与该项目并将成果纳入本科生和研究生课程，对教育产生重大影响。此外，该项目将开发的分布式迭代统计计算（DISC）软件包旨在为博士生提供一个平台。学生和研究人员喜欢使用网络连接的计算机来试验在并行或更准确地说是网格计算环境中开发高效迭代蒙特卡罗算法的新想法。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Faming Liang其他文献

Networks Involved in Coronary Collateral Formation

参与冠状动脉侧支形成的网络

DOI：
发表时间：
期刊：
影响因子：
0
作者：
Jian Zhang;J. Regieli;M. Schipper;M. M. Entius;Faming Liang;J. Koerselman;H. J. Ruven;Yolanda van der Graaf;D. Grobbee;Pieter A. Doevendans;Pieter A. Doevendans
通讯作者：
Pieter A. Doevendans