High-Dimensional Random Forests Learning, Inference, and Beyond

高维随机森林学习、推理及其他

基本信息

批准号：
2310981
负责人：
Yingying Fan
金额：
$ 25万
依托单位：
University of Southern California
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-08-15 至 2026-07-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2310981&HistoricalAwards=false
关键词：
Dimensional Random Forests Learning Inference

项目摘要

Random Forests are one of the most popularly used computational methods for making predictions. The approach works by creating a group of decision-makers, like a team of experts, and then aggregates the individual predictions by these experts to form the final prediction. The great success of Random Forests has been verified by the superior performance when applied to many different types of data. Despite the tremendous success, Random Forests are still largely regarded as a Black-box method because of the limited theoretical understanding of it. The complicated nature of the algorithm and lack of theoretical understanding also make the results it produces less reproducible and hard to interpret. The project will theoretically study the properties of Random Forests to understand when the algorithm works, and more importantly, when the algorithm fails. Such studies can provide practitioners with more confidence and better guidance in applying Random Forests. The project will investigate how to improve the interpretability of Random Forests. Finally, with the understanding gained from these studies, the project will study how to improve the performance of the algorithm to make it even more useful for big data analysis. These research activities will offer numerous training initiatives for professional development of the next generation of statisticians and data scientists.Recently, there has been made important progress in the analysis of random forest algorithms, for instance, proof of the polynomial consistency rate of the original version of Random Forests in the high dimensional setting, without making specific assumptions of the regression function and feature distribution. Yet, there are still many fundamentally important questions left unanswered. The overall objective of this project is to provide an in-depth understanding of complicated ensemble methods such as Random Forests, and provide improved, interpretable, and reproducible statistical estimation and inference results. The project will first study some important open questions about Random Forests, and then move to the statistical inference. In particular, recent studies have confirmed that Random Forests can adapt to sparse models. A natural question is how to undermine the underlying true sparsity structure. Furthermore, some preliminary results suggest that popular existing methods are biased when there exists feature collinearity. The project will develop valid feature importance measures and further investigate the calculation of p-values for evaluating conditional feature importance in the existence of feature collinearity. The project will also move beyond Random Forests and study the larger problem of the conditional independence test. Utilizing the insights gained from these theoretical studies, the project will further develop an improved ensemble learning method for better prediction, interpretability, and reproducibility in big data analysis.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

随机森林是用于进行预测的最常用的计算方法之一。该方法通过创建一组决策者（例如专家团队）来起作用，然后将这些专家的个人预测汇总以形成最终的预测。当应用于许多不同类型的数据时，随机森林的巨大成功已通过出色的性能得到验证。尽管取得了巨大的成功，但由于对其的理论理解有限，随机森林仍被视为黑盒方法。算法的复杂本质和缺乏理论理解也使其产生的结果不那么可重复和难以解释。从理论上讲，该项目将研究随机森林的特性，以了解算法何时起作用，更重要的是，算法失败。这样的研究可以为从业者提供更多的信心和更好的指导，以应用随机森林。该项目将研究如何改善随机森林的解释性。最后，借助这些研究的理解，该项目将研究如何提高算法的性能，以使其对大数据分析更有用。这些研究活动将为下一代统计学家和数据科学家的专业发展提供众多培训计划。当然，在分析随机森林算法的分析中取得了重要的进步，例如，在高二光环境中原始森林的多项式一致性率证明，而无需对回归函数和分布的特定假设。然而，仍然存在许多根本重要的问题。该项目的总体目的是对复杂的集合方法（例如随机森林）提供深入的理解，并提供改进，可解释和可重复的统计估计和推理结果。该项目将首先研究有关随机森林的一些重要开放问题，然后转到统计推断。特别是，最近的研究证实，随机森林可以适应稀疏模型。一个自然的问题是如何破坏潜在的真实稀疏结构。此外，一些初步结果表明，当存在共线性时，流行的现有方法会偏差。该项目将制定有效的特征重要性度量，并进一步研究p值的计算，以评估特征共线性存在中有条件特征的重要性。该项目还将超越随机森林，并研究有条件独立测试的更大问题。利用从这些理论研究中获得的见解，该项目将进一步开发一种改进的集合学习方法，以更好地预测，可解释性和可重复性。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的知识分子优点和更广泛影响的审查标准来通过评估来支持的。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Yingying Fan其他文献

A Practical Li-Ion Full Cell with a High-Capacity Cathode and Electrochemically Exfoliated Graphene Anode: Superior Electrochemical and Low-Temperature Performance

具有高容量正极和电化学剥离石墨烯负极的实用锂离子全电池：卓越的电化学和低温性能

DOI：
10.1021/acsaem.8b01524
发表时间：
2019-01
期刊：
ACS applied energy materials
影响因子：
6.4
作者：
Zhonghui Sun;Zheng Li;Xing-Long Wu;Mingqiang Zou;D;an Wang;Zhenyi Gu;Jianan Xu;Yingying Fan;Shiyu Gan;Dongxue Han;Li Niu
通讯作者：
Li Niu