CAREER: Learning and Selecting Low-Dimensional Models from Incomplete Data

职业:从不完整数据中学习和选择低维模型

基本信息

  • 批准号:
    2239479
  • 负责人:
  • 金额:
    $ 60万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-02-01 至 2028-01-31
  • 项目状态:
    未结题

项目摘要

Big datasets often have an underlying structure. Identifying such a structure allows predicting outcomes of interest based on a few variables, for example, predicting the effectiveness of a drug or vaccine based on the drug’s molecular structure. There exists a wide variety of methods to learn the underlying structure of a dataset and make accurate predictions. However, when data is severely incomplete, as is the case in many modern datasets, existing methods consistently fail to identify the correct structure of the data. More alarmingly, the existing methodology has no means to verify whether the structure found is correct or not. In other words, whenever data is incomplete, the structure learned by any existing method cannot be trusted and may result in undetectable, arbitrarily wrong predictions. This project will (i) develop methods to learn structures specifically tailored to handle missing data and (ii) develop a theory to verify whether the structure learned by any method (including existing ones) is correct or not. In turn, this research will enable scientists to learn the structures governing their incomplete datasets in a plethora of applications to the benefit of society, including drug discovery, metagenomics, and opportunistic screening. Furthermore, this project will support outreach activities to engage underrepresented minorities in machine learning, both locally and nationally, through hands-on activities, social media campaigns, symposia, courses, and mentoring.The technical aims of the project are divided into three main thrusts. The first thrust will investigate a new approach that maps incomplete data to the Grassmann manifold of subspaces, wherein the data’s underlying structure can be revealed by solving a constrained optimization over the Schubert varieties defined by the observed data. The second thrust will develop model-selection criteria to determine the structure that best fits an incomplete dataset, among a collection of candidate structures. These criteria will be generalizations of the Akaike and Bayes information criteria and the minimum effective dimension, adapted to account for missing data. These criteria will be complemented with a goodness-of-fit test to determine if the winning structure is, indeed, a good fit for the data. These are non-trivial tasks that require special considerations in light of missing data, which can consistently cause spurious structures fit arbitrarily large datasets with the same degree of error as the correct structures. Ultimately, the results from this thrust will allow determining whether the predictions stemming from a specific structure can be trusted or not. The third thrust will implement our methodology in open-source, easy-to-use software to benefit of the broader scientific community and test it on datasets related to our ongoing interdisciplinary collaborations in metagenomics, single-cell sequencing, sonotypes classification, bacteria classification, drug discovery, and clinical opportunistic screening.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
大数据集通常具有基础结构,可以根据一些变量来预测感兴趣的结果,例如,根据药物的分子结构来预测药物或疫苗的有效性。存在多种学习方法。然而,当数据严重不完整时,就像许多现代数据集的情况一样,现有的方法始终无法识别数据的正确结构,更令人担忧的是,现有的方法无法识别数据的正确结构。验证找到的结构是否是换句话说,无论何时数据不完整,任何现有方法学习的结构都不能被信任,并且可能导致无法检测的、任意错误的预测,该项目将(i)开发专门用于处理丢失数据的学习结构的方法。 (ii) 发展一种理论来验证通过任何方法(包括现有方法)学习的结构是否正确。反过来,这项研究将使科学家能够在大量应用中学习管理不完整数据集的结构,从而受益。社会的影响,包括药物发现,此外,该项目将支持外展活动,通过实践活动、社交媒体活动、研讨会、课程和指导,让当地和全国范围内代表性不足的少数群体参与机器学习。该项目的技术目标第一个主旨将研究一种将不完整数据映射到子空间的格拉斯曼流形的新方法,因此可以通过解决舒伯特簇的约束优化来揭示数据的底层结构。第二个重点是开发模型选择标准,以确定在候选结构集合中最适合不完整数据集的结构,这些标准将是 Akaike 和贝叶斯信息标准和最小有效维度的概括。 ,这些标准将通过拟合优度检验来补充,以确定获胜结构是否确实适合数据。这些都是需要特别考虑的重要任务。丢失数据,这可能会持续导致虚假结构适合任意大的数据集,其误差程度与正确结构相同。最终,该推力的结果将允许确定来自特定结构的预测是否可信。第三个推力将公开实施我们的方法。 -源代码易于使用的软件,可让更广泛的科学界受益,并在与我们在宏基因组学、单细胞测序、声型分类、细菌分类、药物发现和临床机会筛查方面正在进行的跨学科合作相关的数据集上进行测试。这奖通过使用基金会的智力价值和更广泛的影响审查标准进行评估,NSF 的法定使命被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Daniel Pimentel-Alarcon其他文献

Daniel Pimentel-Alarcon的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

宏观创新追赶战略、适应性学习与企业创新行为选择
  • 批准号:
    72303033
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
融合模糊粗糙集和稀疏图学习的特征选择方法研究
  • 批准号:
    62376230
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
基于自稀疏和特征深度选择的深度学习可解释性模型、方法及应用
  • 批准号:
    12271429
  • 批准年份:
    2022
  • 资助金额:
    45 万元
  • 项目类别:
    面上项目
基于机器学习和常规电子健康数据的抗菌药物合理选择研究
  • 批准号:
  • 批准年份:
    2022
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于在线机器学习的水声协作信息传输链路选择与干扰管理
  • 批准号:
  • 批准年份:
    2022
  • 资助金额:
    54 万元
  • 项目类别:
    面上项目

相似海外基金

Unsupervised Learning for Selecting and Creating Features for Time Series Prediction
用于选择和创建时间序列预测特征的无监督学习
  • 批准号:
    545205-2019
  • 财政年份:
    2019
  • 资助金额:
    $ 60万
  • 项目类别:
    University Undergraduate Student Research Awards
Improving the Implementation and Sustainment of EBPs in Mental Health: Developing and Piloting the Collaborative Organizational Approach to Selecting and Tailoring Implementation Strategies (COAST-IS)
改善心理健康 EBP 的实施和维持:开发和试点选择和定制实施策略的协作组织方法 (COAST-IS)
  • 批准号:
    9371116
  • 财政年份:
    2017
  • 资助金额:
    $ 60万
  • 项目类别:
Efficient transfer learning method by selecting feature extraction processing
通过选择特征提取处理的高效迁移学习方法
  • 批准号:
    17K00334
  • 财政年份:
    2017
  • 资助金额:
    $ 60万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
IMPROVING THE IMPLEMENTATION AND SUSTAINMENT OF EBPS IN MENTAL HEALTH: DEVELOPING AND PILOTING THE COLLABORATIVE ORGANIZATIONAL APPROACH TO SELECTING AND TAILORING IMPLEMENTATION STRATEGIES (COAST-IS
改善心理健康中 EBPS 的实施和维持:制定和试点协作组织方法来选择和定制实施策略 (COAST-IS
  • 批准号:
    9982124
  • 财政年份:
    2017
  • 资助金额:
    $ 60万
  • 项目类别:
Signal Recognition Mechanisms by Selecting Higher-Order Spectral Features Through Learning
通过学习选择高阶光谱特征的信号识别机制
  • 批准号:
    16K00322
  • 财政年份:
    2016
  • 资助金额:
    $ 60万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了