Deep Learning and Random Forests for High-Dimensional Regression

用于高维回归的深度学习和随机森林

基本信息

批准号：
2054808
负责人：
Jason Klusowski
金额：
$ 15.8万
依托单位：
Princeton University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-09-01 至 2023-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2054808&HistoricalAwards=false
关键词：
Deep Learning Random Forests Dimensional

项目摘要

This project aims to investigate two of the most widely used and state-of-the-art methods for high-dimensional regression: deep neural networks and random forests. Despite their widespread implementation, pinning down their theoretical properties has eluded researchers until recently. The proposed research aims to add to the growing body of literature on their analysis, by both developing tools of theoretical value and providing guarantees and guidance for practitioners and applied scientists who use these popular methods frequently in their work.The success of multi-layer networks has largely been buoyed by their ability to generalize well despite being able to fit most datasets, given enough parameters. This phenomenon is particularly striking when the input dimension is far greater than the available sample size, as is the case with many modern applications in molecular biology, medical imaging, and astrophysics, to name a few. A major component of the proposed work will be to obtain complexity bounds for classes of deep neural networks with controls on the size of their weights, which can then be used to bound generalization error and statistical risk. These complexity bounds reveal the role of complexity penalization, which is based on certain norms of the weights of the network. Motivated by these observations, another stream of the proposed research seeks to provide statistical guarantees of certain complexity penalized estimators and their adaptive properties. Current theoretical results for random forests are either for stylized versions of those that are used in practice or are asymptotic in nature and it is therefore difficult to determine the quality of convergence as a function of the parameters of the random forest. Furthermore, the setting for the analysis of more practical implementations of random forests is limited to structured, fixed-dimensional regression function classes. Given these restrictions, the first component of the proposal aims to investigate how random forests behave in the high-dimensional regime when the number of predictors grows with the sample size. Another research objective is to isolate and study families of flexible high-dimensional regression functions for which finite sample convergence rates can be established. The final endeavor of this project is to connect popular measures of variable importance to the bias of random forests. Since variable importance measures are used for assessing the role each predictor variable plays in influencing the output, this connection will partially explain why random forests are adaptive to sparsity. The relationship will also help to theoretically motivate variable importance measures as useful tools for model interpretability.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目旨在研究两种最广泛使用和最先进的高维回归方法：深度神经网络和随机森林。尽管它们得到了广泛的应用，但研究人员直到最近才确定它们的理论特性。拟议的研究旨在通过开发具有理论价值的工具，并为在工作中经常使用这些流行方法的从业者和应用科学家提供保证和指导，来丰富其分析的文献。多层网络的成功尽管在给定足够的参数的情况下能够适应大多数数据集，但很大程度上得益于它们良好的泛化能力。当输入维度远大于可用样本大小时，这种现象尤其引人注目，例如分子生物学、医学成像和天体物理学等许多现代应用的情况。所提出的工作的一个主要组成部分是获得深度神经网络类别的复杂性界限，并控制其权重的大小，然后可以将其用于限制泛化误差和统计风险。这些复杂性界限揭示了复杂性惩罚的作用，复杂性惩罚基于网络权重的某些规范。受这些观察的推动，拟议研究的另一部分旨在为某些复杂性惩罚估计量及其自适应特性提供统计保证。当前随机森林的理论结果要么是实践中使用的程式化版本，要么本质上是渐近的，因此很难确定收敛质量作为随机森林参数的函数。此外，随机森林的更实际实现的分析设置仅限于结构化、固定维度的回归函数类。考虑到这些限制，该提案的第一个组成部分旨在研究当预测变量的数量随着样本大小的增加而增长时，随机森林在高维状态下的表现如何。另一个研究目标是分离和研究灵活的高维回归函数族，可以为其建立有限样本收敛率。该项目的最终目标是将不同重要性的流行度量与随机森林的偏差联系起来。由于变量重要性度量用于评估每个预测变量在影响输出中所起的作用，因此这种联系将部分解释为什么随机森林能够适应稀疏性。这种关系还将有助于从理论上激发变量重要性测量作为模型可解释性的有用工具。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力优点和更广泛的影响审查标准进行评估，被认为值得支持。