III: Small: RUI: Scalable and Iterative Statistical Testing of Multiple Hypotheses on Massive Datasets

III：小型：RUI：海量数据集上多个假设的可扩展和迭代统计检验

基本信息

批准号：
2006765
负责人：
Matteo Riondato
金额：
$ 37.34万
依托单位：
Amherst College
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-10-01 至 2024-09-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2006765&HistoricalAwards=false
关键词：
III Small RUI Scalable Iterative

项目摘要

Modern scientific practice is rooted on statistical testing of hypotheses on data. To limit the risk of false discoveries, the tests must offer strict statistical guarantees. The task is very challenging due to the sheer amount of rich data available today, and to the ever-increasing number of complex hypotheses that scientists want to test on the same data. In order for science to advance, and therefore advance society and human well-being, it is of the foremost importance that scientists are given tools that overcome these challenges. This project will design novel computational methods for statistical hypothesis testing that tackle all the above challenges by combining modern statistical results with recent approaches from the area of knowledge discovery and data mining, a field of computer science dealing with the efficient analysis of data. As part of the educational activities, this project will develop materials for college-level courses to ensure that the next generation of scientists and computer scientists posses the intellectual and practical knowledge to ensure a statistically-sound analysis of data and testing of hypotheses by using and extending the methods developed in the project. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.The team of researchers in this project will design and mathematically analyze algorithms to make statistical hypothesis testing iterative and scalable along multiple dimensions. Many existing statistical procedures are already computationally expensive when testing a single hypothesis on moderate-size datasets, and become even more inefficient as the amount of data or the number of hypotheses grows. Along the dimension of data complexity, available tests often lack scalability because limited to simple types of data (e.g., binary tables), while fewer methods are available for rich data such as attributed graphs or panel time-series. The lack of scalable methods may be due in part to the requirement that hypothesis tests satisfy stringent statistical guarantees (e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR)) to ensure that the successive inference is sound. Additionally, the iterative aspect of the practice of data analysis has been ignored for statistical tests, but considering it is crucial in order to ensure that these guarantees are satisfied. This project will develop algorithms for the scalable and iterative statistical testing of multiple complex hypotheses on massive rich datasets, while imposing only weak assumptions on the data generation process, and controlling the FWER and the FDR. These results will be achieved by bringing together two areas of computer science research that had, until now, only very limited points of contact: statistical learning theory and data mining. The novel methods developed in this project will use concepts from the former, such as (local) Rademacher averages, covering numbers, and pseudodimension, to exploit the structure of the class of hypotheses being tested and achieve better sample complexity bounds, which translate to higher statistical power and improved control of the FWER/FDR, even in an iterative data analysis setting. These concepts will be adapted to statistical hypothesis testing and strengthen to fully exploit their practical usefulness, especially on rich datasets and in the presence of dependencies between the data points. The project team will use techniques from the knowledge discovery task of pattern mining to efficiently explore the space of hypotheses to filter out those that are definitively not significant. To reach this goal, the project team will develop novel bounds for the p-value functions of different tests and adapt these techniques to rich datasets such as attributed graphs.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代科学实践植根于对数据假设的统计检验。为了限制错误发现的风险，测试必须提供严格的统计保证。由于当今可用的丰富数据以及科学家想要在相同数据上测试的复杂假设数量不断增加，因此该任务非常具有挑战性。为了使科学进步，并因此发展了社会和人类的福祉，这至关重要，它为克服这些挑战而获得了科学家的工具。该项目将设计用于统计假设测试的新型计算方法，该方法通过将现代统计结果与知识发现和数据挖掘领域的最新方法相结合，以应对上述所有挑战，这是一个计算机科学领域，涉及有效的数据分析。作为教育活动的一部分，该项目将开发大学课程的材料，以确保下一代科学家和计算机科学家拥有智力和实践知识，以确保通过使用和扩展项目中开发的方法对数据进行统计分析以及对假设进行测试。该项目的研究人员团队将设计和数学分析算法，以使统计假设测试迭代迭代迭代且可扩展沿多个维度，将参与该项目的研究和教育组成部分的各种各样的本科生。在测试中等大小的数据集上的单个假设时，许多现有的统计程序在计算上已经很昂贵，并且随着数据量或假设的数量的增长而变得更加低效。沿数据复杂性的维度，可用的测试通常缺乏可扩展性，因为仅限于简单的数据类型（例如二进制表），而较少的方法可用于富含数据的数据，例如属性图或面板时间序列。缺乏可扩展方法可能部分是由于假设检验满足严格的统计保证（例如，家庭误差率（FWER）和错误发现率（FDR））以确保连续推断是否合理的要求。此外，对于统计检验，忽略了数据分析实践的迭代方面，但是考虑到这对于确保满足这些保证是至关重要的。该项目将开发用于大量丰富数据集的多个复杂假设的可扩展和迭代统计测试的算法，同时仅对数据生成过程施加较弱的假设，并控制FWER和FDR。这些结果将通过汇集两个计算机科学研究的领域来实现，这些领域到目前为止，直到现在，接触点非常有限：统计学习理论和数据挖掘。该项目中开发的新方法将使用前者的概念，例如（本地）Rademacher平均值，覆盖数字和假二焦点，以利用测试测试的假设类别的结构，并获得更好的样品复杂性界限，这些假设的统计功率转化为更高的统计能力，并改善了FWER/FDR的控制，甚至在主观数据分析中，也可以改善。这些概念将适应统计假设检验并加强以充分利用它们的实际实用性，尤其是在数据点之间存在依赖关系的情况下。项目团队将使用来自模式挖掘的知识发现任务的技术，以有效地探索假设的空间，以滤除那些绝对不重要的假设。为了实现这一目标，项目团队将为不同测试的P值函数开发新颖的界限，并将这些技术调整到诸如属性图之类的丰富数据集中。该奖项反映了NSF的法定任务，并被认为是通过基金会的知识分子优点和更广泛的审查标准来通过评估来通过评估来支持的。

项目成果

期刊论文数量（13）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Sharp uniform convergence bounds through empirical centralization

DOI：
发表时间：
2020
期刊：
影响因子：
0
作者：
Cyrus Cousins;Matteo Riondato
通讯作者：
Cyrus Cousins;Matteo Riondato

Alice and the Caterpillar: A more descriptive null model for assessing data mining results

DOI：
10.1109/icdm54844.2022.00052
发表时间：
2022-11
期刊：
Knowledge and Information Systems
影响因子：
2.7
作者：
Giulia Preti;G. D. F. Morales;Matteo Riondato
通讯作者：
Giulia Preti;G. D. F. Morales;Matteo Riondato

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

MCRapper：Poset 族的 Monte-Carlo Rademacher 平均值和近似模式挖掘

DOI：
10.1145/3532187
发表时间：
2022
期刊：
ACM Transactions on Knowledge Discovery from Data
影响因子：
3.6
作者：
Pellegrina, Leonardo;Cousins, Cyrus;Vandin, Fabio;Riondato, Matteo
通讯作者：
Riondato, Matteo

Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages

DOI：
10.1145/3447548.3467354
发表时间：
2021-08
期刊：
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
影响因子：
0
作者：
Cyrus Cousins;Chloe Wohlgemuth;Matteo Riondato
通讯作者：
Cyrus Cousins;Chloe Wohlgemuth;Matteo Riondato

Statistically-sound Knowledge Discovery from Data

从数据中发现统计上合理的知识