CAREER: Statistical Learning with Recursive Partitioning: Algorithms, Accuracy, and Applications

职业：递归分区的统计学习：算法、准确性和应用

基本信息

批准号：
2239448
负责人：
Jason Klusowski
金额：
$ 45万
依托单位：
Princeton University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-06-01 至 2028-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2239448&HistoricalAwards=false
关键词：
CAREER Statistical Learning Recursive Partitioning

项目摘要

As data-driven technologies continue to be adopted and deployed in high-stakes decision-making environments, the need for fast, interpretable algorithms has never been more important. As one such candidate, it has become increasingly common to use decision trees, a hierarchically organized data structure, for building a predictive or causal model. This trend is spurred by the appealing connection between decision trees and rule-based decision-making, particularly in clinical, legal, or business contexts, as the tree structure mimics the sequential way a human user may think and reason, thereby facilitating human-machine interaction. To make them fast to compute, decision trees are popularly constructed with an algorithm called recursive partitioning, in which the decision nodes of the tree are learned from the data in a greedy, top-down manner. The overarching goal of this project is to develop a precise understanding of the strengths and limitations of decision trees based on recursive partitioning, and, in doing so, gain insights on how to improve their performance in practice. In addition to this impact, high-school, undergraduate, and graduate research assistants will be vertically integrated and benefit both academically and professionally. Innovative curricula, workshops, and data and methods competitions involving students, academics, and industry professionals will facilitate outreach and encourage participation from a broad audience. This proposal aims to provide a comprehensive study of the statistical properties of greedy recursive partitioning algorithms for training decision trees, as is demonstrated in two fundamental contexts. The first thrust of the project will develop a theoretical framework for the analysis of oblique decision trees, where, in contrast to conventional axis-aligned splits involving only a single covariate, the splits at each decision node occur at linear combinations of the covariates. While this methodology has garnered significant attention from the computer science and optimization communities since the mid-80s, the advantages they offer over their axis-aligned counterparts remain only empirically justified, and explanations for their success are largely based on heuristics. Filling this long-standing gap between theory and practice, the PI will investigate how oblique regression trees, constructed by recursively minimizing squared error, can adapt to a rich class of regression models consisting of linear combinations of ridge functions. This provides a quantitative baseline for a statistician to compare and contrast decision trees with other less interpretable methods, such as projection pursuit regression and neural networks, that target similar model forms. Crucially, to address the combinatorial complexity of finding the optimal splitting hyperplane at each decision node, the PI’s framework can accommodate many existing computational tools in the literature. A major component of the research is derived from connections between recursive partitioning and sequential greedy approximation algorithms for convex optimization problems (e.g., orthogonal greedy algorithms). The second thrust focuses on the delicate pointwise properties of axis-aligned recursive partitioning, with implications for heterogeneous causal effect estimation, where accurate pointwise estimates over the entire support of the covariates are essential for valid inference (e.g., testing hypotheses and constructing confidence intervals). Motivated by simple setting where decision trees provably fail to achieve optimal performance, the PI will investigate how the signal-to-noise ratio affects the quality of pointwise estimation. While the focus is on causal effect estimation directly using decision trees, the PI will also investigate implications for multi-step semi-parametric settings, where preliminary unknown functions (e.g., propensity scores) are estimated with machine learning tools, as well as conditional quantile regression, both of which require estimators with high pointwise accuracy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

随着数据驱动的技术继续在高风险决策环境中采用并部署，对快速，可解释的算法的需求从未如此重要。作为一个这样的候选人，使用决策树（一种层次组织的数据结构）来构建预测性或因果模型已变得越来越普遍。决策树与基于规则的决策之间的有吸引力的联系，尤其是在临床，法律或业务环境中，刺激了这种趋势，因为树结构模仿了人类用户可以思考和理性的顺序方式，从而支持人机相互作用。为了使它们快速计算，决策树通常是用一种称为递归分区的算法构建的，其中从数据中以贪婪，自上而下的方式从数据中学到了树的决策节点。该项目的总体目标是基于递归分区的决策树的优势和局限性，并为此获得有关如何在实践中提高其绩效的见解。除了这种影响外，高中，本科和研究生研究助理还将垂直整合并准确和专业地受益。涉及学生，学者和行业专业人士的创新课程，研讨会以及数据和方法竞赛将促进宣传并鼓励广泛受众的参与。该建议旨在对培训决策树的贪婪递归分区算法的统计特性进行全面研究，这在两个基本情况下证明。该项目的第一个推力将开发一个理论框架，用于分析倾斜决策树，与仅涉及单个协变量的常规轴一致分裂相比，每个决策节点在协变量的线性组合处发生分裂。自从80年代中期以来，这种方法引起了计算机科学和优化社区的极大关注，但它们比轴心一致的对应物所提供的优势仍然是迫切合理的，而且成功的解释在很大程度上基于启发式方法。填补理论和实践之间的这一长期存在的差距，PI将研究如何通过递归最小化平方误差而构建的倾斜回归树，可以适应由脊函数的线性组合组成的丰富回归模型。这为统计学家提供了定量基线，以将目标树与其他不容易解释的方法进行比较和对比决策树，例如投影追踪回归和神经网络，这些方法针对相似的模型形式。至关重要的是，为了解决在每个决策节点上找到最佳分裂超平面的组合复杂性，PI的框架可以容纳文献中的许多现有计算工具。该研究的主要组成部分来自递归分区与凸优化问题的顺序贪婪近似算法之间的连接（例如，正交贪婪算法）。第二个推力着重于轴对准的递归分区的微妙特性，对异质性灾难效应估计的影响，其中对协变量的整个支持的准确估计值对有效推理至关重要（例如，测试假设和构造置信区间）。在决策树正确无法实现最佳性能的情况下，PI将调查信噪比如何影响点估计的质量。虽然重点是使用决策树直接进行因果效应估计，但PI还将调查对多步骤半参数设置的影响，其中初步的未知功能（例如，承诺得分）是用机器学习工具估算的，它是用机器学习工具进行估算的，以及有条件的分数回归，两者都需要使用高点的估计来进行评估。优点和更广泛的影响审查标准。