CAREER: Mitigating the Lack of Labeled Training Data in Machine Learning Based on Multi-level Optimization

职业:基于多级优化缓解机器学习中标记训练数据的缺乏

基本信息

  • 批准号:
    2339216
  • 负责人:
  • 金额:
    $ 50万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2024
  • 资助国家:
    美国
  • 起止时间:
    2024-09-01 至 2029-08-31
  • 项目状态:
    未结题

项目摘要

Machine learning has demonstrated great success in numerous applications such as autonomous driving, early detection of diseases, drug design, etc. The accuracy of machine learning models highly depends on the accessibility of large-scale, human-labeled training data. However, such data is often very challenging to acquire in specialized domains such as healthcare, legislation, environmental sciences due to the high costs involved in obtaining high-grade human labels and data privacy concerns. This project will advance science by providing algorithms, software, and systems that can automatically generate high-quality labeled data to mitigate the lack of labeled training data in specific domains and and allow training of highly accurate machine learning models. The project will significantly broaden the applicability of machine learning across various application areas by lowering data barriers and will substantially reduce the labor costs of manual data annotation. For example, it will promote scientific discovery in structural biology and high-energy physics and streamline engineering design in wireless communication. It will facilitate the early detection of sepsis, lung cancer, Parkinson's disease, and sleep apnea, improving patient outcomes and quality of life. Applied to compound design and cement production, the developed technologies have the potential to expedite drug discovery and reduce energy consumption. To achieve the goal of creating high-quality labeled training data, this project will develop three complementary paradigms of novel approaches based on multi-level optimization and large language models, for: 1) end-to-end generation of labeled data; 2) annotation of unlabeled data; and, 3) example-specific adaptation/selection of labeled source data, respectively. First, the proposed data generation methods will leverage the worst-case and class-specific performance of downstream models to provide end-to-end and fine-grained guidance for generating data (with complex labels) that is tailored to improve the accuracy and robustness of downstream models, and to promote balanced performance across different classes. Second, the proposed data annotation methods will leverage an end-to-end mechanism that capitalizes on large language models, a sequence of verification procedures, and available side information to maximize the accuracy of generated labels. Third, the proposed adaptation/selection methods will distinguish between source examples that are inside or outside of a target domain and subsequently determine an example-specific adaptation/selection action end-to-end to ensure optimal use of source data. In addition, the proposed novel optimization algorithms and distributed systems will effectively tackle new challenges related to multi-level optimization, including non-differentiability, incompatibility with the optimizers of large language models, and scalability. This project represents the first one systematically leveraging multi-level optimization to create labeled data, effectively addressing a fundamental knowledge gap that existing methods often lack capabilities to perform end-to-end execution of multiple learning stages and therefore fall short in tailoring generated data to improve downstream models’ performance. Another significant innovation of this project is its effective harnessing of large language models for data annotation, which will substantially reduce the costs of manual labeling.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
然而,机器学习在自动驾驶、疾病早期检测、药物设计等众多应用中取得了巨大成功。机器学习模型的准确性在很大程度上取决于大规模、人工标记的训练数据的可访问性。由于获取高级人类标签的成本高昂以及数据隐私问题,在医疗保健、立法、环境科学等专业领域获取信息通常非常具有挑战性。该项目将通过提供可实现的算法、软件和系统来推动科学发展。自动生成高质量的标记数据,以缓解标记训练数据的缺乏该项目将通过降低数据障碍,显着拓宽机器学习在各个应用领域的适用性,并将大幅降低手动数据注释的劳动力成本。结构生物学和高能物理学的科学发现以及简化无线通信的工程设计将有助于脓毒症、肺癌、帕金森病和睡眠呼吸暂停的早期检测,改善患者的治疗效果和生活质量。水泥生产,所开发的技术具有为了实现创建高质量标记训练数据的目标,该项目将开发基于多级优化和大型语言模型的三种互补的新方法范式,用于:1)最终标记数据的最终生成;2)未标记数据的注释;3)标记源数据的特定于示例的适应/选择首先,所提出的数据生成方法将利用最坏情况和特定于类的性能。提供下游模型用于生成数据(具有复杂标签)的端到端和细粒度的指导,旨在提高下游模型的准确性和鲁棒性,并促进不同类别之间的平衡性能。 其次,所提出的数据注释方法将利用端到端机制利用大型语言模型、一系列验证程序和可用的辅助信息来最大限度地提高生成标签的准确性。第三,所提出的适应/选择方法将区分内部或外部的源示例。目标域,然后确定特定于示例的此外,所提出的新颖的优化算法和分布式系统将有效地解决与多级优化相关的新挑战,包括不可微性、与优化器的不兼容性。该项目代表了第一个系统地利用多级优化来创建标记数据的项目,有效地解决了现有方法通常缺乏执行多个学习阶段的端到端执行的能力的基本知识差距。无法定制生成的数据以改进下游该项目的另一个重大创新是有效利用大型语言模型进行数据注释,这将大大降低人工标注的成本。该奖项反映了 NSF 的法定使命,并通过使用基金会的评估被认为值得支持。智力价值和更广泛的影响审查标准。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Pengtao Xie其他文献

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization
使用多级优化的掩蔽自动编码器中的下游任务引导掩蔽学习
  • DOI:
    10.48550/arxiv.2402.18128
  • 发表时间:
    2024-02-28
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Han Guo;Ramtin Hosseini;Ruiyi Zhang;Sai Ashish Somayajula;Ranak Roy Chowdhury;Rajesh K. Gupta;Pengtao Xie
  • 通讯作者:
    Pengtao Xie
Simultaneous Selection and Adaptation of Source Data via Four-Level Optimization
通过四级优化同时选择和适应源数据
Interpretable unsupervised learning enables accurate clustering with high-throughput imaging flow cytometry
可解释的无监督学习可通过高通量成像流式细胞术实现精确聚类
  • DOI:
    10.1038/s41598-023-46782-w
  • 发表时间:
    2023-11-23
  • 期刊:
  • 影响因子:
    4.6
  • 作者:
    Zunming Zhang;Xinyu Chen;Rui Tang;Yuxuan Zhu;Han Guo;Yunjia Qu;Pengtao Xie;Ian Y Lian;Yingxiao Wang;Yu
  • 通讯作者:
    Yu
Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models
具有增强的大型语言模型可检测性和语义一致性的特定于令牌的水印
  • DOI:
    10.48550/arxiv.2402.18059
  • 发表时间:
    2024-02-28
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mingjia Huo;Sai Ashish Somayajula;Youwei Liang;Ruisi Zhang;F. Koushanfar;Pengtao Xie
  • 通讯作者:
    Pengtao Xie
BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM
BLO-SAM:基于双层优化的 SAM 防过拟合微调
  • DOI:
    10.1007/978-1-4614-0650-1_8
  • 发表时间:
    2024-02-26
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Li Zhang;Youwei Liang;Ruiyi Zhang;Amirhosein Javadi;Pengtao Xie
  • 通讯作者:
    Pengtao Xie

Pengtao Xie的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

基于内质网应激-NLRP3途径探讨厚朴酚缓解仔猪肠上皮屏障功能损伤的作用及机制研究
  • 批准号:
    32372928
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
氨基酸感应器GCN2调控Beclin-1介导的自噬缓解自身免疫性甲状腺炎的作用研究
  • 批准号:
    82370792
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
去泛素化酶JOSD2通过增加心肌细胞SERCA2a稳定性缓解心肌肥厚的机制研究
  • 批准号:
    82370244
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
肠道菌代谢物吲哚丙酸通过m6A修饰介导软骨细胞脂代谢重编程缓解老年腰椎终板退变的机制研究
  • 批准号:
    82372436
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
柴胡疏肝散缓解情志应激介导的TAMs磷脂过氧化抑制卵巢癌进展的机制研究
  • 批准号:
    82374327
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目

相似海外基金

Domino - Computational Fluid Dynamics Modelling of Ink Droplet Breakup for Mitigating Mist Formation during inkjet printing
Domino - 墨滴破碎的计算流体动力学模型,用于减轻喷墨打印过程中的雾气形成
  • 批准号:
    10090067
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Collaborative R&D
Mitigating salmon gill disease by integrating genotype-environment studies with host-gill microbiome associations
通过将基因型-环境研究与宿主-鳃微生物组关联相结合来减轻鲑鱼鳃病
  • 批准号:
    BB/Y005295/1
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Research Grant
Collaborative Research: AF: Medium: Algorithms Meet Machine Learning: Mitigating Uncertainty in Optimization
协作研究:AF:媒介:算法遇见机器学习:减轻优化中的不确定性
  • 批准号:
    2422926
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
Collaborative Research: Leveraging the interactions between carbon nanomaterials and DNA molecules for mitigating antibiotic resistance
合作研究:利用碳纳米材料和 DNA 分子之间的相互作用来减轻抗生素耐药性
  • 批准号:
    2307222
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
CAREER: Strengthening the Theoretical Foundations of Federated Learning: Utilizing Underlying Data Statistics in Mitigating Heterogeneity and Client Faults
职业:加强联邦学习的理论基础:利用底层数据统计来减轻异构性和客户端故障
  • 批准号:
    2340482
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了