CAREER: Data Valuation in the Wild: Theories, Algorithms, and Applications

职业：野外数据评估：理论、算法和应用

基本信息

批准号：
2239622
负责人：
Ruoxi Jia
金额：
$ 50万
依托单位：
Virginia Polytechnic Institute and State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-02-01 至 2028-01-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2239622&HistoricalAwards=false
关键词：
CAREER Data Valuation Wild Theories

项目摘要

Data are essential ingredients for building machine learning (ML) applications. The ability to quantify and measure the value of data is critical to the entire ML lifecycle: from identifying useful data sources, to setting propriety over samples during training, and to interpreting the reason why certain behaviors of a model emerge during deployment. The potential of data valuation has been observed in many applications over the past few years. However, intermixed with these positive results is a vast array of applications for which existing data valuation techniques are not yet applicable, or too expensive to execute, or produce valuation results with substantial uncertainty. This project aims to enable data valuation to overcome applicability, scalability, and reproducibility challenges and transition to a practical and reliable tool for a data-centric future. This work will have a broad impact on society in terms of facilitating automated data quality management, designing incentives for data sharing, and improving the robustness of ML applications. This project will train undergraduate students to solve ML problems from both an algorithmic and a data quality perspective, while in the meantime creating useful school-age learning modules implemented at local, regional, and global scales. The project consists of four research tasks to advance data valuation from different dimensions: 1) designing data valuation techniques that are robust to overcome the randomness in modern ML training algorithms; 2) developing new frameworks to determine the value of data samples given limited information about downstream learning tasks; 3) investigating principled methods to value heterogeneous and streaming data; and 4) creating and open-sourcing a unified multi-faceted evaluation platform to spur future advances in more complex data valuation. The proposed techniques are implemented and validated on a variety of high-impact real-world applications, including autonomous driving, energy-efficient buildings, and conversational artificial intelligence.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

数据是构建机器学习（ML）应用程序的重要成分。量化和衡量数据价值的能力对整个ML生命周期至关重要：从识别有用的数据源到在训练过程中对样本设置礼节，以及解释在部署过程中某些模型的某些行为的原因。在过去的几年中，在许多应用中都观察到了数据评估的潜力。但是，与这些积极结果混合是一系列应用程序，现有的数据评估技术尚不适用，或者太昂贵而无法执行或产生具有很大不确定性的估值结果。该项目旨在使数据评估能够克服适用性，可伸缩性和可重复性挑战，并过渡到以数据为中心的实用和可靠的工具。这项工作将在促进自动数据质量管理，为数据共享设计激励措施以及改善ML应用程序的鲁棒性方面对社会产生广泛的影响。该项目将培训本科生从算法和数据质量的角度解决ML问题，同时创建在本地，区域和全球规模实施的有用的学龄儿童学习模块。该项目由四个研究任务组成，以从不同的维度中提高数据估值：1）设计数据估值技术，这些技术可靠，以克服现代ML培训算法的随机性； 2）开发新框架以确定有关下游学习任务的有限信息，以确定数据样本的价值； 3）研究有原则的方法来重视异质和流数据； 4）创建和开源一个统一的多面评估平台，以刺激更复杂的数据评估未来进步。所提出的技术将在各种高影响现实世界的应用程序上实施和验证，包括自主驾驶，节能建筑物和对话人工智能。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的知识分子和更广泛的影响来通过评估来通过评估来支持的。

项目成果

期刊论文数量（3）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

2D-Shapley: A Framework for Fragmented Data Valuation

DOI：
10.48550/arxiv.2306.10473
发表时间：
2023-06
期刊：
影响因子：
0
作者：
Zhihong Liu;H. Just;Xiangyu Chang;X. Chen;R. Jia
通讯作者：
Zhihong Liu;H. Just;Xiangyu Chang;X. Chen;R. Jia

LAVA: Data Valuation without Pre-Specified Learning Algorithms

DOI：
10.48550/arxiv.2305.00054
发表时间：
2023-04
期刊：
ArXiv
影响因子：
0
作者：
H. Just;Feiyang Kang;Jiachen T. Wang;Yi Zeng;Myeongseob Ko;Ming Jin;R. Jia
通讯作者：
H. Just;Feiyang Kang;Jiachen T. Wang;Yi Zeng;Myeongseob Ko;Ming Jin;R. Jia

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning