Collaborative Research: Efficient Parallel Iterative Monte Carlo Methods for Statistical Analysis of Big Data
合作研究:用于大数据统计分析的高效并行迭代蒙特卡罗方法
基本信息
- 批准号:1316922
- 负责人:
- 金额:$ 8.16万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2013
- 资助国家:美国
- 起止时间:2013-08-01 至 2016-07-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The integration of computer technology into science and daily life has enabled the collection of massive volumes of data. To analyze these data, one may have to resort to parallel and distributed architectures. While the parallel and distributed architectures present new capabilities for storage and manipulation of big data, it is unclear, from the inferential point of view, how the current statistical methodology can be transported to the paradigm of big data. Also, growing data size typically comes together with a growing complexity of data structures and of the models needed to account for the structures. Although iterative Monte Carlo algorithms, such as the Markov chain Monte Carlo (MCMC), stochastic approximation, and expectation-maximization (EM) algorithms, have proven to be very powerful and typically unique computational tools for analyzing data of complex structures, they are infeasible for big data as for which a large number of iterations and a complete scan of the full dataset for each iteration are typically required. Big data have put a great challenge on the current statistical methodology. The investigators propose a general principle for developing Monte Carlo algorithms that are feasible for big data and workable on parallel and distributed architectures; that is, using Monte Carlo averages calculated in parallel from subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle avoids the requirement for repeated scans of full data in algorithm iterations, while enabling the algorithm to produce statistically sensible solutions to the problem under consideration. Under this principle, a general algorithm, the so-called subsampling approximation-based parallel stochastic approximation algorithm, is proposed for parameter estimation for big data problems. Unlike the existing algorithms, such as the bag of little bootstraps, aggregated estimation equation, and split-and-conquer algorithms, the proposed algorithm works for the problems for which the observations are generally dependent. Under the same principle, a subsampling approximation-based parallel Metropolis-Hastings algorithm is proposed for Bayesian analysis of big data, and a subsampling approximation-based parallel Monte Carlo EM algorithm is proposed for parameter estimation for the big data problems with missing observations. In addition to the subsampling approximation-based parallel iterative Monte Carlo algorithms, an embarrassingly parallel MCMC algorithm is proposed for Bayesian analysis of big data based on the popular idea of divide-and-conquer. Various schemes of dataset partition and results aggregation are proposed. The validity of the proposed parallel iterative Monte Carlo algorithms, including both the subsampling approximation-based and embarrassingly parallel ones, will be rigorously studied. The proposed algorithms will be applied to spatio-temporal modeling of satellite climate data, genome-wide association study, and stream data analysis.The intellectual merit of this project is to propose a general principle for statistical analysis of big data: Using Monte Carlo averages of subsamples to approximate the quantities that originally need to calculate from the full dataset. This principle provides a general strategy for transporting the current statistical methodology to the paradigm of big data. Under this principle, a few subsampling approximation-based parallel iterative Monte Carlo algorithms are proposed. The proposed algorithms address the core problem of big data analysis?how to make a statistically sensible analysis for big data while avoiding repeated scans of the full dataset. This project will have broader impacts because big data are ubiquitous throughout almost all fields of science and technology. A successful research program in theory and methods of parallel iterative Monte Carlo computations can have immense benefit widely throughout science and technology. The research results will be disseminated to the communities of interest, such as atmospheric science, biomedical science, engineering, and social science, via direct collaboration with researchers in these disciplines, conference presentations, books, and papers to be published in academic journals. The project will have also significant impacts on education through direct involvement of graduate students in the project and incorporation of results into undergraduate and graduate courses. In addition, the package Distributed Iterative Statistical Computing (DISC) that will be developed under this project is designed to provide a platform for Ph.D. students and researchers like the investigators with network-connected computers to experiment new ideas of developing efficient iterative Monte Carlo algorithms in parallel or, more exactly, grid computing environments.
计算机技术与科学和日常生活的结合使得海量数据的收集成为可能。为了分析这些数据,人们可能不得不求助于并行和分布式架构。虽然并行和分布式架构提供了大数据存储和操作的新能力,但从推理的角度来看,目前尚不清楚如何将当前的统计方法转移到大数据范式。此外,数据规模的增长通常伴随着数据结构和解释结构所需的模型的复杂性的增长。尽管迭代蒙特卡罗算法,例如马尔可夫链蒙特卡罗 (MCMC)、随机逼近和期望最大化 (EM) 算法,已被证明是非常强大且通常是分析复杂结构数据的独特计算工具,但它们是不可行的对于大数据,通常需要大量迭代以及每次迭代对完整数据集的完整扫描。大数据对当前的统计方法提出了巨大的挑战。研究人员提出了开发蒙特卡罗算法的一般原则,该算法适用于大数据并且可在并行和分布式架构上运行;也就是说,使用从子样本并行计算的蒙特卡洛平均值来近似最初需要从完整数据集计算的数量。这一原则避免了在算法迭代中重复扫描完整数据的要求,同时使算法能够为所考虑的问题生成统计上合理的解决方案。在此原则下,提出了一种通用算法,即所谓的基于子采样近似的并行随机近似算法,用于大数据问题的参数估计。与现有算法(例如小引导包、聚合估计方程和分治算法)不同,该算法适用于通常依赖于观测值的问题。基于同样的原理,提出了一种基于子采样近似的并行Metropolis-Hastings算法用于大数据的贝叶斯分析,并提出了一种基于子采样近似的并行Monte Carlo EM算法用于大数据缺失观测问题的参数估计。除了基于下采样近似的并行迭代蒙特卡罗算法之外,基于流行的分而治之思想,提出了一种用于大数据贝叶斯分析的尴尬并行MCMC算法。提出了各种数据集划分和结果聚合方案。我们将严格研究所提出的并行迭代蒙特卡罗算法的有效性,包括基于子采样近似的算法和令人尴尬的并行算法。所提出的算法将应用于卫星气候数据的时空建模、全基因组关联研究和流数据分析。该项目的智力价值在于提出了大数据统计分析的通用原则:使用蒙特卡罗平均值子样本的数量来近似最初需要从完整数据集计算的数量。这一原则提供了将当前统计方法转移到大数据范式的总体策略。在此原则下,提出了几种基于子采样近似的并行迭代蒙特卡罗算法。所提出的算法解决了大数据分析的核心问题——如何对大数据进行统计上合理的分析,同时避免重复扫描完整数据集。该项目将产生更广泛的影响,因为大数据几乎遍及所有科学技术领域。并行迭代蒙特卡罗计算的理论和方法的成功研究项目可以在整个科学和技术领域产生广泛的巨大益处。 研究成果将通过与这些学科的研究人员的直接合作、会议报告、书籍和在学术期刊上发表的论文,传播给大气科学、生物医学科学、工程学和社会科学等感兴趣的社区。该项目还将通过研究生直接参与该项目并将成果纳入本科生和研究生课程,对教育产生重大影响。此外,该项目将开发的分布式迭代统计计算(DISC)软件包旨在为博士生提供一个平台。学生和研究人员喜欢使用网络连接的计算机来试验在并行或更准确地说是网格计算环境中开发高效迭代蒙特卡罗算法的新想法。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Chuanhai Liu其他文献
Alternating Subspace-Spanning Resampling to Accelerate Markov Chain Monte Carlo Simulation
交替子空间跨越重采样加速马尔可夫链蒙特卡罗模拟
- DOI:
10.1198/016214503388619148 - 发表时间:
2003 - 期刊:
- 影响因子:0
- 作者:
Chuanhai Liu - 通讯作者:
Chuanhai Liu
Reweighted Anderson-Darling Tests of Goodness-of-Fit
重新加权的 Anderson-Darling 拟合优度检验
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Chuanhai Liu - 通讯作者:
Chuanhai Liu
Not Asked and Not Answered: Multiple Imputation for Multiple Surveys: Rejoinder
没有询问也没有回答:多项调查的多重插补:反驳
- DOI:
- 发表时间:
1998 - 期刊:
- 影响因子:0
- 作者:
A. Gelman;Gary King;Chuanhai Liu - 通讯作者:
Chuanhai Liu
Settle the unsettling: an inferential models perspective
解决令人不安的问题:推理模型的视角
- DOI:
- 发表时间:
2021 - 期刊:
- 影响因子:0
- 作者:
Chuanhai Liu;Ryan Martin - 通讯作者:
Ryan Martin
Bartlett's decomposition of the posterior distribution of the covariance for normal monotone ignorable missing data
正态单调可忽略缺失数据协方差后验分布的 Bartlett 分解
- DOI:
- 发表时间:
1993 - 期刊:
- 影响因子:0
- 作者:
Chuanhai Liu - 通讯作者:
Chuanhai Liu
Chuanhai Liu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Chuanhai Liu', 18)}}的其他基金
Collaborative Research: Prior-free probabilistic inferential methods for "large-p-small-n" linear regression problems
合作研究:“大-p-小-n”线性回归问题的无先验概率推理方法
- 批准号:
1208841 - 财政年份:2012
- 资助金额:
$ 8.16万 - 项目类别:
Continuing Grant
Large-Scale Multinomial Inference and Its Applications in Genome-Wide Association Studies
大规模多项式推理及其在全基因组关联研究中的应用
- 批准号:
1007678 - 财政年份:2010
- 资助金额:
$ 8.16万 - 项目类别:
Continuing Grant
相似国自然基金
边缘智能下基于张量计算的时空场景图高效推理方法研究
- 批准号:62302131
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于微结构光纤的高效光电异质结集成和能量调控机理研究
- 批准号:62305029
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于语义解耦和提示的高效监控视频编码与分析方法研究
- 批准号:62302246
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
面向大规模高维数据的高效相似性检索方法研究
- 批准号:62302110
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于凸优化的相控阵-天线罩系统一体化方向图高效综合方法研究
- 批准号:62301379
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: Beyond the Single-Atom Paradigm: A Priori Design of Dual-Atom Alloy Active Sites for Efficient and Selective Chemical Conversions
合作研究:超越单原子范式:双原子合金活性位点的先验设计,用于高效和选择性化学转化
- 批准号:
2334970 - 财政年份:2024
- 资助金额:
$ 8.16万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
- 批准号:
2412357 - 财政年份:2024
- 资助金额:
$ 8.16万 - 项目类别:
Standard Grant
Collaborative Research: Reversible Computing and Reservoir Computing with Magnetic Skyrmions for Energy-Efficient Boolean Logic and Artificial Intelligence Hardware
合作研究:用于节能布尔逻辑和人工智能硬件的磁斯格明子可逆计算和储层计算
- 批准号:
2343606 - 财政年份:2024
- 资助金额:
$ 8.16万 - 项目类别:
Standard Grant
Collaborative Research: Beyond the Single-Atom Paradigm: A Priori Design of Dual-Atom Alloy Active Sites for Efficient and Selective Chemical Conversions
合作研究:超越单原子范式:双原子合金活性位点的先验设计,用于高效和选择性化学转化
- 批准号:
2334969 - 财政年份:2024
- 资助金额:
$ 8.16万 - 项目类别:
Standard Grant
Collaborative Research: Integrated Materials-Manufacturing-Controls Framework for Efficient and Resilient Manufacturing Systems
协作研究:高效、弹性制造系统的集成材料制造控制框架
- 批准号:
2346650 - 财政年份:2024
- 资助金额:
$ 8.16万 - 项目类别:
Standard Grant