III: Small: RUI: Scalable and Iterative Statistical Testing of Multiple Hypotheses on Massive Datasets

III:小型:RUI:海量数据集上多个假设的可扩展和迭代统计检验

基本信息

  • 批准号:
    2006765
  • 负责人:
  • 金额:
    $ 37.34万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

Modern scientific practice is rooted on statistical testing of hypotheses on data. To limit the risk of false discoveries, the tests must offer strict statistical guarantees. The task is very challenging due to the sheer amount of rich data available today, and to the ever-increasing number of complex hypotheses that scientists want to test on the same data. In order for science to advance, and therefore advance society and human well-being, it is of the foremost importance that scientists are given tools that overcome these challenges. This project will design novel computational methods for statistical hypothesis testing that tackle all the above challenges by combining modern statistical results with recent approaches from the area of knowledge discovery and data mining, a field of computer science dealing with the efficient analysis of data. As part of the educational activities, this project will develop materials for college-level courses to ensure that the next generation of scientists and computer scientists posses the intellectual and practical knowledge to ensure a statistically-sound analysis of data and testing of hypotheses by using and extending the methods developed in the project. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.The team of researchers in this project will design and mathematically analyze algorithms to make statistical hypothesis testing iterative and scalable along multiple dimensions. Many existing statistical procedures are already computationally expensive when testing a single hypothesis on moderate-size datasets, and become even more inefficient as the amount of data or the number of hypotheses grows. Along the dimension of data complexity, available tests often lack scalability because limited to simple types of data (e.g., binary tables), while fewer methods are available for rich data such as attributed graphs or panel time-series. The lack of scalable methods may be due in part to the requirement that hypothesis tests satisfy stringent statistical guarantees (e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR)) to ensure that the successive inference is sound. Additionally, the iterative aspect of the practice of data analysis has been ignored for statistical tests, but considering it is crucial in order to ensure that these guarantees are satisfied. This project will develop algorithms for the scalable and iterative statistical testing of multiple complex hypotheses on massive rich datasets, while imposing only weak assumptions on the data generation process, and controlling the FWER and the FDR. These results will be achieved by bringing together two areas of computer science research that had, until now, only very limited points of contact: statistical learning theory and data mining. The novel methods developed in this project will use concepts from the former, such as (local) Rademacher averages, covering numbers, and pseudodimension, to exploit the structure of the class of hypotheses being tested and achieve better sample complexity bounds, which translate to higher statistical power and improved control of the FWER/FDR, even in an iterative data analysis setting. These concepts will be adapted to statistical hypothesis testing and strengthen to fully exploit their practical usefulness, especially on rich datasets and in the presence of dependencies between the data points. The project team will use techniques from the knowledge discovery task of pattern mining to efficiently explore the space of hypotheses to filter out those that are definitively not significant. To reach this goal, the project team will develop novel bounds for the p-value functions of different tests and adapt these techniques to rich datasets such as attributed graphs.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
现代科学实践植根于对数据假设的统计检验。为了限制错误发现的风险,测试必须提供严格的统计保证。这项任务非常具有挑战性,因为当今可用的数据数量巨大,而且科学家想要用相同的数据测试的复杂假设数量不断增加。为了推动科学进步,从而促进社会和人类福祉,最重要的是为科学家提供克服这些挑战的工具。该项目将为统计假设检验设计新颖的计算方法,通过将现代统计结果与知识发现和数据挖掘领域(处理数据有效分析的计算机科学领域)的最新方法相结合来解决上述所有挑战。作为教育活动的一部分,该项目将为大学课程开发材料,以确保下一代科学家和计算机科学家拥有知识和实践知识,以确保通过使用和测试对数据进行统计分析并检验假设。扩展项目中开发的方法。各种各样的本科生将参与该项目的研究和教育部分。该项目的研究人员团队将设计和数学分析算法,以使统计假设检验在多个维度上迭代和扩展。在中等规模的数据集上测试单个假设时,许多现有的统计过程在计算上已经很昂贵,并且随着数据量或假设数量的增长而变得更加低效。在数据复杂性方面,可用的测试通常缺乏可扩展性,因为仅限于简单类型的数据(例如二进制表),而可用于丰富数据(例如属性图或面板时间序列)的方法较少。缺乏可扩展的方法可能部分是由于假设检验满足严格的统计保证(例如,家族错误率(FWER)和错误发现率(FDR))的要求,以确保连续的推理是合理的。此外,统计测试忽略了数据分析实践的迭代方面,但为了确保满足这些保证,考虑这一点至关重要。该项目将开发算法,用于在大量丰富的数据集上对多个复杂假设进行可扩展和迭代的统计测试,同时对数据生成过程仅施加弱假设,并控制 FWER 和 FDR。这些成果将通过将迄今为止只有非常有限的接触点的计算机科学研究的两个领域结合在一起来实现:统计学习理论和数据挖掘。该项目中开发的新颖方法将使用前者的概念,例如(局部)Rademacher 平均值、覆盖数和伪维度,以利用正在测试的假设类别的结构并实现更好的样本复杂性界限,从而转化为更高的样本复杂性界限。即使在迭代数据分析环境中,FWER/FDR 的统计能力和改进的控制也是如此。这些概念将适应统计假设检验并得到加强,以充分利用其实际用途,特别是在丰富的数据集和数据点之间存在依赖性的情况下。项目团队将使用模式挖掘的知识发现任务中的技术来有效地探索假设空间,以过滤掉那些绝对不重要的假设。为了实现这一目标,项目团队将为不同测试的 p 值函数开发新的界限,并使这些技术适应丰富的数据集,例如属性图。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准。

项目成果

期刊论文数量(13)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Sharp uniform convergence bounds through empirical centralization
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Cyrus Cousins;Matteo Riondato
  • 通讯作者:
    Cyrus Cousins;Matteo Riondato
Alice and the Caterpillar: A more descriptive null model for assessing data mining results
  • DOI:
    10.1109/icdm54844.2022.00052
  • 发表时间:
    2022-11
  • 期刊:
  • 影响因子:
    2.7
  • 作者:
    Giulia Preti;G. D. F. Morales;Matteo Riondato
  • 通讯作者:
    Giulia Preti;G. D. F. Morales;Matteo Riondato
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
MCRapper:Poset 族的 Monte-Carlo Rademacher 平均值和近似模式挖掘
Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages
Statistically-sound Knowledge Discovery from Data
从数据中发现统计上合理的知识
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Matteo Riondato其他文献

The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling
SQL查询的VC维和通过采样估计选择性
Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies
基于采样的数据挖掘算法:现代技术和案例研究
  • DOI:
  • 发表时间:
    2014
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Matteo Riondato
  • 通讯作者:
    Matteo Riondato
MiSoSouP
味噌汤
Sharpe Ratio: Estimation, Confidence Intervals, and Hypothesis Testing
夏普比率:估计、置信区间和假设检验
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Matteo Riondato
  • 通讯作者:
    Matteo Riondato
Statistically-Sound Knowledge Discovery from Data: Challenges and Directions
从数据中发现统计上合理的知识:挑战和方向

Matteo Riondato的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Matteo Riondato', 18)}}的其他基金

CAREER: Statistically-Sound Knowledge Discovery from Data
职业:从数据中发现统计上合理的知识
  • 批准号:
    2238693
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Continuing Grant
NSF Student Travel Grant for 2019 SIAM International Conference on Data Mining (SDM)
2019 年 SIAM 国际数据挖掘会议 (SDM) NSF 学生旅费补助
  • 批准号:
    1918446
  • 财政年份:
    2019
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant

相似国自然基金

单细胞分辨率下的石杉碱甲介导小胶质细胞极化表型抗缺血性脑卒中的机制研究
  • 批准号:
    82304883
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
小分子无半胱氨酸蛋白调控生防真菌杀虫活性的作用与机理
  • 批准号:
    32372613
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
诊疗一体化PS-Hc@MB协同训练介导脑小血管病康复的作用及机制研究
  • 批准号:
    82372561
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
非小细胞肺癌MECOM/HBB通路介导血红素代谢异常并抑制肿瘤起始细胞铁死亡的机制研究
  • 批准号:
    82373082
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
FATP2/HILPDA/SLC7A11轴介导肿瘤相关中性粒细胞脂代谢重编程影响非小细胞肺癌放疗免疫的作用和机制研究
  • 批准号:
    82373304
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目

相似海外基金

III: Small: RUI: Designing Structure-Phenotype Query-Retrieval and Analysis Systems for Microscopy-Based Whole Organism Studies
III:小:RUI:为基于显微镜的整个生物体研究设计结构表型查询检索和分析系统
  • 批准号:
    2401096
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: A Fairness Auditing Framework for Predictive Mobility Models
III:小:RUI:预测移动模型的公平性审核框架
  • 批准号:
    2304213
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Finding Best Representative Phylogenetic Tree Reconciliations
III:小:RUI:寻找最佳代表性系统发育树协调
  • 批准号:
    2231150
  • 财政年份:
    2022
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Collaborative Research: Modeling Pre- and Post- Conditions for Understanding Events
III:小:RUI:协作研究:为理解事件建模前后条件
  • 批准号:
    2007128
  • 财政年份:
    2020
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Interagency Agreement
III: Small: RUI: Investigating Fragmentation Rules and Improving Metabolite Identification Using Graph Grammar and Statistical Methods
III:小:RUI:使用图语法和统计方法研究断裂规则并改进代谢物识别
  • 批准号:
    2053286
  • 财政年份:
    2020
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了