CAREER: Learning and Selecting Low-Dimensional Models from Incomplete Data

职业：从不完整数据中学习和选择低维模型

基本信息

批准号：
2239479
负责人：
Daniel Pimentel-Alarcon
金额：
$ 60万
依托单位：
University of Wisconsin-Madison
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-02-01 至 2028-01-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2239479&HistoricalAwards=false
关键词：
CAREER Learning Selecting Low Dimensional

项目摘要

Big datasets often have an underlying structure. Identifying such a structure allows predicting outcomes of interest based on a few variables, for example, predicting the effectiveness of a drug or vaccine based on the drug’s molecular structure. There exists a wide variety of methods to learn the underlying structure of a dataset and make accurate predictions. However, when data is severely incomplete, as is the case in many modern datasets, existing methods consistently fail to identify the correct structure of the data. More alarmingly, the existing methodology has no means to verify whether the structure found is correct or not. In other words, whenever data is incomplete, the structure learned by any existing method cannot be trusted and may result in undetectable, arbitrarily wrong predictions. This project will (i) develop methods to learn structures specifically tailored to handle missing data and (ii) develop a theory to verify whether the structure learned by any method (including existing ones) is correct or not. In turn, this research will enable scientists to learn the structures governing their incomplete datasets in a plethora of applications to the benefit of society, including drug discovery, metagenomics, and opportunistic screening. Furthermore, this project will support outreach activities to engage underrepresented minorities in machine learning, both locally and nationally, through hands-on activities, social media campaigns, symposia, courses, and mentoring.The technical aims of the project are divided into three main thrusts. The first thrust will investigate a new approach that maps incomplete data to the Grassmann manifold of subspaces, wherein the data’s underlying structure can be revealed by solving a constrained optimization over the Schubert varieties defined by the observed data. The second thrust will develop model-selection criteria to determine the structure that best fits an incomplete dataset, among a collection of candidate structures. These criteria will be generalizations of the Akaike and Bayes information criteria and the minimum effective dimension, adapted to account for missing data. These criteria will be complemented with a goodness-of-fit test to determine if the winning structure is, indeed, a good fit for the data. These are non-trivial tasks that require special considerations in light of missing data, which can consistently cause spurious structures fit arbitrarily large datasets with the same degree of error as the correct structures. Ultimately, the results from this thrust will allow determining whether the predictions stemming from a specific structure can be trusted or not. The third thrust will implement our methodology in open-source, easy-to-use software to benefit of the broader scientific community and test it on datasets related to our ongoing interdisciplinary collaborations in metagenomics, single-cell sequencing, sonotypes classification, bacteria classification, drug discovery, and clinical opportunistic screening.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

大数据集通常具有基础结构。确定这种结构可以根据一些变量预测感兴趣的结果，例如，根据药物的分子结构来预测药物或疫苗的有效性。存在各种各样的方法来学习数据集的基础结构并做出准确的预测。但是，当数据严重不完整时，就像许多现代数据集一样，现有方法始终无法识别数据的正确结构。更令人震惊的是，现有的方法无需验证发现的结构是否正确。换句话说，每当数据不完整时，任何现有方法学到的结构都无法得到信任，并且可能会导致无法检测到的任意错误的预测。该项目将（i）开发方法来学习专门针对处理缺失数据的结构，并（ii）开发一种理论来验证通过任何方法（包括现有方法）学习的结构是否正确。反过来，这项研究将使科学家能够在众多的应用程序中学习管理其不完整数据集的结构，以使社会受益，包括药物发现，宏基因组学和机会主义筛查。此外，该项目将通过动手活动，社交媒体运动，聊天室，课程和心理培训来支持宣传活动，以使代表性不足的少数群体参与机器学习。该项目的技术目的分为三个主要推力。第一个推力将研究一种新方法，该方法将不完整的数据映射到子空间的Grassmann歧管，其中可以通过求解对所观察到的数据定义的Schubert品种的约束优化来揭示数据的基础结构。第二个推力将制定模型选择标准，以确定最适合不完整数据集的结构，这是候选结构的集合。这些标准将是Akaike和贝叶斯信息标准的概括以及最小有效维度，以说明丢失的数据。这些标准将通过合适的测试来完成，以确定获胜结构是否确实适合数据。这些是非平凡的任务，鉴于缺少的数据需要特殊考虑，这可能会导致伪造的结构拟合任意大型数据集，其误差程度与正确的结构相同。最终，这一推力的结果将允许确定是否可以信任来自特定结构的预测。第三个力量将通过开源，易于使用的软件实施我们的方法论，以使更广泛的科学界受益，并在与我们正在进行的核对基因组学，单细胞测序，单细胞测序，单细胞测序，细菌分类，细菌分类，药物发现，药物发现和临床机会筛选的依据的依据的跨学科合作有关的数据集上进行测试。智力优点和更广泛的影响审查标准。