CAREER: Statistical Learning with Recursive Partitioning: Algorithms, Accuracy, and Applications
职业:递归分区的统计学习:算法、准确性和应用
基本信息
- 批准号:2239448
- 负责人:
- 金额:$ 45万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-06-01 至 2028-05-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
As data-driven technologies continue to be adopted and deployed in high-stakes decision-making environments, the need for fast, interpretable algorithms has never been more important. As one such candidate, it has become increasingly common to use decision trees, a hierarchically organized data structure, for building a predictive or causal model. This trend is spurred by the appealing connection between decision trees and rule-based decision-making, particularly in clinical, legal, or business contexts, as the tree structure mimics the sequential way a human user may think and reason, thereby facilitating human-machine interaction. To make them fast to compute, decision trees are popularly constructed with an algorithm called recursive partitioning, in which the decision nodes of the tree are learned from the data in a greedy, top-down manner. The overarching goal of this project is to develop a precise understanding of the strengths and limitations of decision trees based on recursive partitioning, and, in doing so, gain insights on how to improve their performance in practice. In addition to this impact, high-school, undergraduate, and graduate research assistants will be vertically integrated and benefit both academically and professionally. Innovative curricula, workshops, and data and methods competitions involving students, academics, and industry professionals will facilitate outreach and encourage participation from a broad audience. This proposal aims to provide a comprehensive study of the statistical properties of greedy recursive partitioning algorithms for training decision trees, as is demonstrated in two fundamental contexts. The first thrust of the project will develop a theoretical framework for the analysis of oblique decision trees, where, in contrast to conventional axis-aligned splits involving only a single covariate, the splits at each decision node occur at linear combinations of the covariates. While this methodology has garnered significant attention from the computer science and optimization communities since the mid-80s, the advantages they offer over their axis-aligned counterparts remain only empirically justified, and explanations for their success are largely based on heuristics. Filling this long-standing gap between theory and practice, the PI will investigate how oblique regression trees, constructed by recursively minimizing squared error, can adapt to a rich class of regression models consisting of linear combinations of ridge functions. This provides a quantitative baseline for a statistician to compare and contrast decision trees with other less interpretable methods, such as projection pursuit regression and neural networks, that target similar model forms. Crucially, to address the combinatorial complexity of finding the optimal splitting hyperplane at each decision node, the PI’s framework can accommodate many existing computational tools in the literature. A major component of the research is derived from connections between recursive partitioning and sequential greedy approximation algorithms for convex optimization problems (e.g., orthogonal greedy algorithms). The second thrust focuses on the delicate pointwise properties of axis-aligned recursive partitioning, with implications for heterogeneous causal effect estimation, where accurate pointwise estimates over the entire support of the covariates are essential for valid inference (e.g., testing hypotheses and constructing confidence intervals). Motivated by simple setting where decision trees provably fail to achieve optimal performance, the PI will investigate how the signal-to-noise ratio affects the quality of pointwise estimation. While the focus is on causal effect estimation directly using decision trees, the PI will also investigate implications for multi-step semi-parametric settings, where preliminary unknown functions (e.g., propensity scores) are estimated with machine learning tools, as well as conditional quantile regression, both of which require estimators with high pointwise accuracy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
随着数据驱动技术在高风险决策环境中不断被采用和部署,对快速、可解释算法的需求变得越来越重要,决策树(一种分层算法)的使用也变得越来越普遍。有组织的数据结构,用于构建预测或因果模型,这种趋势是由决策树和基于规则的决策之间的有吸引力的联系推动的,特别是在临床、法律或商业环境中,因为树结构模仿了顺序方式。人类用户可以思考和推理,从而促进为了使计算速度更快,决策树通常使用一种称为递归分区的算法来构建,其中树的决策节点以贪婪的、自上而下的方式从数据中学习。该项目旨在准确了解基于递归划分的决策树的优点和局限性,并在此过程中深入了解如何提高其在实践中的表现。和研究生研究助理将进行垂直整合,并在学术和专业方面都受益。涉及学生、学者和行业专业人士的创新课程、研讨会以及数据和方法竞赛将促进推广并鼓励广大受众的参与。用于训练决策树的贪婪递归划分算法的统计特性,如两个基本背景所示,该项目的第一个主旨将开发一个用于分析倾斜决策树的理论框架,与传统的轴对齐分割相反。虽然只有一个协变量,但每个决策节点的分裂都发生在涉及协变量的协变量的线性组合上。虽然自 80 年代中期以来,这种方法已引起计算机科学和优化界的广泛关注,但它们相对于轴而言具有以下优势。对齐胶囊仅在经验上是合理的,其成功的解释很大程度上基于启发式方法,为了填补理论与实践之间长期存在的差距,PI 将研究通过递归最小化平方误差构建的倾斜回归树如何适应。由岭函数的线性组合组成的丰富的回归模型,这为统计学家提供了定量基线,以将决策树与其他难以解释的方法(例如针对类似模型形式的投影寻踪回归和神经网络)进行比较和对比。为了解决在每个决策节点查找分裂超平面的组合复杂性,PI 的框架可以容纳文献中的许多现有计算工具,该研究的一个主要组成部分源自递归分区和顺序贪婪之间的联系。凸优化问题的近似算法(例如,正交贪婪算法)的第二个重点是轴对齐递归划分的微妙的逐点属性,这对异构因果效应估计具有影响,其中对协变量的整个支持进行准确的逐点估计。对于有效推理(例如,测试假设和构建置信区间)至关重要。在决策树可能无法实现最佳性能的简单设置的推动下,PI 将研究如何实现。信噪比影响逐点估计的质量,虽然重点是直接使用决策树进行因果效应估计,但 PI 还将研究多步半参数设置的影响,其中初步未知函数(例如倾向)。分数)是通过机器学习工具以及条件分位数回归进行估计的,这两种工具都需要具有高点精度的估计器。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力优点和更广泛的评估进行评估,被认为值得支持影响审查标准。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Jason Klusowski其他文献
Jason Klusowski的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Jason Klusowski', 18)}}的其他基金
Deep Learning and Random Forests for High-Dimensional Regression
用于高维回归的深度学习和随机森林
- 批准号:
2054808 - 财政年份:2020
- 资助金额:
$ 45万 - 项目类别:
Continuing Grant
Deep Learning and Random Forests for High-Dimensional Regression
用于高维回归的深度学习和随机森林
- 批准号:
1915932 - 财政年份:2019
- 资助金额:
$ 45万 - 项目类别:
Continuing Grant
相似国自然基金
基于主动统计迁移学习的电动汽车传动系统关键部件智能故障诊断研究
- 批准号:52305109
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
深度统计学习:理论基础与模型设计
- 批准号:62376028
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
完全统计学习原则下的零经验风险记忆学习研究
- 批准号:62366035
- 批准年份:2023
- 资助金额:31 万元
- 项目类别:地区科学基金项目
面向医疗健康数据的隐私保护统计分析和机器学习方法研究
- 批准号:62372425
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
可解释语义耦合的演化统计学习方法
- 批准号:
- 批准年份:2022
- 资助金额:53 万元
- 项目类别:面上项目
相似海外基金
CAREER: New Frameworks for Ethical Statistical Learning: Algorithmic Fairness and Privacy
职业:道德统计学习的新框架:算法公平性和隐私
- 批准号:
2340241 - 财政年份:2024
- 资助金额:
$ 45万 - 项目类别:
Continuing Grant
Identifying and Addressing the Effects of Social Media Use on Young Adults' E-Cigarette Use: A Solutions-Oriented Approach
识别和解决社交媒体使用对年轻人电子烟使用的影响:面向解决方案的方法
- 批准号:
10525098 - 财政年份:2023
- 资助金额:
$ 45万 - 项目类别:
Characterizing the genetic etiology of delayed puberty with integrative genomic techniques
利用综合基因组技术表征青春期延迟的遗传病因
- 批准号:
10663605 - 财政年份:2023
- 资助金额:
$ 45万 - 项目类别:
Toward measures and behavioral trials for effective online AUD recovery support
采取措施和行为试验以提供有效的在线澳元复苏支持
- 批准号:
10643056 - 财政年份:2023
- 资助金额:
$ 45万 - 项目类别:
PUFA metabolism for prevention and treatment of TMD pain: an interdisciplinary, translational approach.
PUFA 代谢预防和治疗 TMD 疼痛:一种跨学科的转化方法。
- 批准号:
10820840 - 财政年份:2023
- 资助金额:
$ 45万 - 项目类别: