III: Small: RUI: Scalable and Iterative Statistical Testing of Multiple Hypotheses on Massive Datasets

III:小型:RUI:海量数据集上多个假设的可扩展和迭代统计检验

基本信息

  • 批准号:
    2006765
  • 负责人:
  • 金额:
    $ 37.34万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

Modern scientific practice is rooted on statistical testing of hypotheses on data. To limit the risk of false discoveries, the tests must offer strict statistical guarantees. The task is very challenging due to the sheer amount of rich data available today, and to the ever-increasing number of complex hypotheses that scientists want to test on the same data. In order for science to advance, and therefore advance society and human well-being, it is of the foremost importance that scientists are given tools that overcome these challenges. This project will design novel computational methods for statistical hypothesis testing that tackle all the above challenges by combining modern statistical results with recent approaches from the area of knowledge discovery and data mining, a field of computer science dealing with the efficient analysis of data. As part of the educational activities, this project will develop materials for college-level courses to ensure that the next generation of scientists and computer scientists posses the intellectual and practical knowledge to ensure a statistically-sound analysis of data and testing of hypotheses by using and extending the methods developed in the project. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.The team of researchers in this project will design and mathematically analyze algorithms to make statistical hypothesis testing iterative and scalable along multiple dimensions. Many existing statistical procedures are already computationally expensive when testing a single hypothesis on moderate-size datasets, and become even more inefficient as the amount of data or the number of hypotheses grows. Along the dimension of data complexity, available tests often lack scalability because limited to simple types of data (e.g., binary tables), while fewer methods are available for rich data such as attributed graphs or panel time-series. The lack of scalable methods may be due in part to the requirement that hypothesis tests satisfy stringent statistical guarantees (e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR)) to ensure that the successive inference is sound. Additionally, the iterative aspect of the practice of data analysis has been ignored for statistical tests, but considering it is crucial in order to ensure that these guarantees are satisfied. This project will develop algorithms for the scalable and iterative statistical testing of multiple complex hypotheses on massive rich datasets, while imposing only weak assumptions on the data generation process, and controlling the FWER and the FDR. These results will be achieved by bringing together two areas of computer science research that had, until now, only very limited points of contact: statistical learning theory and data mining. The novel methods developed in this project will use concepts from the former, such as (local) Rademacher averages, covering numbers, and pseudodimension, to exploit the structure of the class of hypotheses being tested and achieve better sample complexity bounds, which translate to higher statistical power and improved control of the FWER/FDR, even in an iterative data analysis setting. These concepts will be adapted to statistical hypothesis testing and strengthen to fully exploit their practical usefulness, especially on rich datasets and in the presence of dependencies between the data points. The project team will use techniques from the knowledge discovery task of pattern mining to efficiently explore the space of hypotheses to filter out those that are definitively not significant. To reach this goal, the project team will develop novel bounds for the p-value functions of different tests and adapt these techniques to rich datasets such as attributed graphs.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
现代科学实践植根于对数据假设的统计检验。为了限制错误发现的风险,测试必须提供严格的统计保证。由于当今可用的丰富数据以及科学家想要在相同数据上测试的复杂假设数量不断增加,因此该任务非常具有挑战性。为了使科学进步,并因此发展了社会和人类的福祉,这至关重要,它为克服这些挑战而获得了科学家的工具。该项目将设计用于统计假设测试的新型计算方法,该方法通过将现代统计结果与知识发现和数据挖掘领域的最新方法相结合,以应对上述所有挑战,这是一个计算机科学领域,涉及有效的数据分析。作为教育活动的一部分,该项目将开发大学课程的材料,以确保下一代科学家和计算机科学家拥有智力和实践知识,以确保通过使用和扩展项目中开发的方法对数据进行统计分析以及对假设进行测试。该项目的研究人员团队将设计和数学分析算法,以使统计假设测试迭代迭代迭代且可扩展沿多个维度,将参与该项目的研究和教育组成部分的各种各样的本科生。在测试中等大小的数据集上的单个假设时,许多现有的统计程序在计算上已经很昂贵,并且随着数据量或假设的数量的增长而变得更加低效。沿数据复杂性的维度,可用的测试通常缺乏可扩展性,因为仅限于简单的数据类型(例如二进制表),而较少的方法可用于富含数据的数据,例如属性图或面板时间序列。缺乏可扩展方法可能部分是由于假设检验满足严格的统计保证(例如,家庭误差率(FWER)和错误发现率(FDR))以确保连续推断是否合理的要求。此外,对于统计检验,忽略了数据分析实践的迭代方面,但是考虑到这对于确保满足这些保证是至关重要的。该项目将开发用于大量丰富数据集的多个复杂假设的可扩展和迭代统计测试的算法,同时仅对数据生成过程施加较弱的假设,并控制FWER和FDR。这些结果将通过汇集两个计算机科学研究的领域来实现,这些领域到目前为止,直到现在,接触点非常有限:统计学习理论和数据挖掘。该项目中开发的新方法将使用前者的概念,例如(本地)Rademacher平均值,覆盖数字和假二焦点,以利用测试测试的假设类别的结构,并获得更好的样品复杂性界限,这些假设的统计功率转化为更高的统计能力,并改善了FWER/FDR的控制,甚至在主观数据分析中,也可以改善。这些概念将适应统计假设检验并加强以充分利用它们的实际实用性,尤其是在数据点之间存在依赖关系的情况下。项目团队将使用来自模式挖掘的知识发现任务的技术,以有效地探索假设的空间,以滤除那些绝对不重要的假设。为了实现这一目标,项目团队将为不同测试的P值函数开发新颖的界限,并将这些技术调整到诸如属性图之类的丰富数据集中。该奖项反映了NSF的法定任务,并被认为是通过基金会的知识分子优点和更广泛的审查标准来通过评估来通过评估来支持的。

项目成果

期刊论文数量(13)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Sharp uniform convergence bounds through empirical centralization
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Cyrus Cousins;Matteo Riondato
  • 通讯作者:
    Cyrus Cousins;Matteo Riondato
Alice and the Caterpillar: A more descriptive null model for assessing data mining results
  • DOI:
    10.1109/icdm54844.2022.00052
  • 发表时间:
    2022-11
  • 期刊:
  • 影响因子:
    2.7
  • 作者:
    Giulia Preti;G. D. F. Morales;Matteo Riondato
  • 通讯作者:
    Giulia Preti;G. D. F. Morales;Matteo Riondato
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
MCRapper:Poset 族的 Monte-Carlo Rademacher 平均值和近似模式挖掘
Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages
Statistically-sound Knowledge Discovery from Data
从数据中发现统计上合理的知识
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Matteo Riondato其他文献

The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling
SQL查询的VC维和通过采样估计选择性
Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies
基于采样的数据挖掘算法:现代技术和案例研究
  • DOI:
  • 发表时间:
    2014
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Matteo Riondato
  • 通讯作者:
    Matteo Riondato
MiSoSouP
味噌汤
Sharpe Ratio: Estimation, Confidence Intervals, and Hypothesis Testing
夏普比率:估计、置信区间和假设检验
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Matteo Riondato
  • 通讯作者:
    Matteo Riondato
Statistically-Sound Knowledge Discovery from Data: Challenges and Directions
从数据中发现统计上合理的知识:挑战和方向

Matteo Riondato的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Matteo Riondato', 18)}}的其他基金

CAREER: Statistically-Sound Knowledge Discovery from Data
职业:从数据中发现统计上合理的知识
  • 批准号:
    2238693
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Continuing Grant
NSF Student Travel Grant for 2019 SIAM International Conference on Data Mining (SDM)
2019 年 SIAM 国际数据挖掘会议 (SDM) NSF 学生旅费补助
  • 批准号:
    1918446
  • 财政年份:
    2019
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant

相似国自然基金

基于小增益理论的物联网聚合计算鲁棒稳定性分析
  • 批准号:
    62303112
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于鲁棒广义短路比的高比例新能源电力系统数据驱动随机小干扰稳定性分析
  • 批准号:
  • 批准年份:
    2020
  • 资助金额:
    24 万元
  • 项目类别:
    青年科学基金项目
Ibrutinib下调MDSCs逆转PD-1抗体治疗晚期非小细胞肺癌耐药的机制探究
  • 批准号:
    81702268
  • 批准年份:
    2017
  • 资助金额:
    20.0 万元
  • 项目类别:
    青年科学基金项目
基于小波-卡尔曼滤波的二维离散随机系统鲁棒H∞控制
  • 批准号:
    61603034
  • 批准年份:
    2016
  • 资助金额:
    20.0 万元
  • 项目类别:
    青年科学基金项目
密集无线网络分布式和鲁棒性传输理论与方法
  • 批准号:
    61571107
  • 批准年份:
    2015
  • 资助金额:
    57.0 万元
  • 项目类别:
    面上项目

相似海外基金

III: Small: RUI: Designing Structure-Phenotype Query-Retrieval and Analysis Systems for Microscopy-Based Whole Organism Studies
III:小:RUI:为基于显微镜的整个生物体研究设计结构表型查询检索和分析系统
  • 批准号:
    2401096
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: A Fairness Auditing Framework for Predictive Mobility Models
III:小:RUI:预测移动模型的公平性审核框架
  • 批准号:
    2304213
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Finding Best Representative Phylogenetic Tree Reconciliations
III:小:RUI:寻找最佳代表性系统发育树协调
  • 批准号:
    2231150
  • 财政年份:
    2022
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Collaborative Research: Modeling Pre- and Post- Conditions for Understanding Events
III:小:RUI:协作研究:为理解事件建模前后条件
  • 批准号:
    2007128
  • 财政年份:
    2020
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Interagency Agreement
III: Small: RUI: Investigating Fragmentation Rules and Improving Metabolite Identification Using Graph Grammar and Statistical Methods
III:小:RUI:使用图语法和统计方法研究断裂规则并改进代谢物识别
  • 批准号:
    2053286
  • 财政年份:
    2020
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了