III: Small: Labeling Massive Data from Noisy, Incomplete and Crowdsourced Annotations
III:小:标记来自嘈杂、不完整和众包注释的海量数据
基本信息
- 批准号:2007836
- 负责人:
- 金额:$ 39.89万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-10-01 至 2024-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Alongside the prosperity of deep learning, the demand for reliably labeled data is unprecedentedly high. Label acquisition is a highly nontrivial task---data labeling is tedious, labor-intensive, and prone to mistakes. Crowdsourcing techniques that integrate annotations from multiple annotators to improve accuracy have been essential for labeling large-scale data. However, existing crowdsourcing techniques face pressing challenges such as heavy workload of annotators, high computational cost, and a lack of strong theoretical guarantees. This project will develop a series of analytical and computational tools for accurately labeling massive datasets from noisy, incomplete, and crowdsourced annotations---with provable guarantees. Leveraging advanced nonnegative matrix factorization theory, this project will offer solutions that are efficient and effective under critical conditions. The outcomes are expected to have broad and substantial positive impacts on the currently label-hungry artificial intelligence industry and the data annotation workforce. For example, the algorithms designed for handling structured data (e.g., speech) will largely benefit timely applications, e.g., intelligent assistants such as Alexa and Siri. The ability of reliably working under largely incomplete data will help design new data dispatch schemes leading to significantly reduced annotator workload. The project will also offer many training opportunities for undergraduate students, with an emphasis on engaging those from underrepresented groups.In terms of theory and methods, many aspects of crowdsourced data labeling (e.g., sample complexity, noise robustness, and identifiability of the underlying statistical model) are still poorly understood. This project will provide a suite of theoretical and computational tools that advance these aspects. To be specific, the first thrust will build up a coupled nonnegative matrix factorization (CNMF) framework that bridges the classic Dawid-Skene model for crowdsourcing and advanced nonnegative factor analysis theories. This will establish firm theoretical foundations for crowdsourcing under critical conditions, and lead to theory-backed algorithms to attain substantially improved sample complexity and noise/incomplete data robustness. The second thrust exploits domain-dependent knowledge, e.g., data structure and annotator dependence, to come up with situation-aware crowdsourcing techniques for enhanced performance. The third thrust designs stochastic optimization strategies to provide scalable implementations for the CNMF framework, and evaluates the proposed methods over a variety of real-world applications. The analytical and computational tools developed in this project will provide strong provable guarantees and refreshing algorithmic solutions for long-standing challenges in crowdsourced data labeling. In addition, the CNMF theory and algorithms are exciting new directions for computational linear algebra, whose impacts can go well beyond this project.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
除了深度学习的繁荣之外,对可靠标记的数据的需求是前所未有的。标签采集是一项高度不足的任务 - 数据标记是乏味,实验室密集型,容易出现错误。将从多个注释者整合以提高准确性的注释的众包技术对于标记大规模数据至关重要。但是,现有的攀登技术面临着紧迫的挑战,例如大量的注释者,高计算成本以及缺乏强大的理论保证。该项目将开发一系列分析和计算工具,以通过噪声,不完整和人群注释准确地标记大量数据集,并提供可证明的保证。利用先进的非负矩阵分解理论,该项目将提供在关键条件下高效有效的解决方案。预计结果将对目前渴望的含标签的人工智能行业和数据注释劳动力产生广泛而实质性的积极影响。例如,设计用于处理结构化数据的算法(例如,语音)将在很大程度上受益于及时的应用,例如Alexa和Siri等智能助手。在很大程度上不完整的数据下可靠工作的能力将有助于设计新的数据调度方案,从而大大减少注释器工作量。该项目还将为本科生提供许多培训机会,重点是与人数不足的组中的培训机会。根据理论和方法,众包数据标签的许多方面(例如,样本复杂性,噪声稳健性以及基本统计模型的样本复杂性,噪声稳健性和认同)仍然不足以理解。该项目将提供一系列理论和计算工具,以推进这些方面。具体来说,第一个推力将构建一个耦合的非负矩阵分解(CNMF)框架,该框架桥接了经典的Dawid-skene模型,用于众包和高级非负因素分析理论。这将在关键条件下为众包建立牢固的理论基础,并导致理论支持的算法,以实现显着改善样品复杂性和噪声/不完全数据鲁棒性。第二个推力利用域依赖性知识,例如数据结构和注释依赖性,以提出情境感知的众包技术以增强性能。第三个推力设计随机优化策略,为CNMF框架提供可扩展的实现,并在各种现实世界应用程序上评估了所提出的方法。该项目中开发的分析和计算工具将为众包数据标记中的长期挑战提供可证明的可证明的保证和刷新算法解决方案。此外,CNMF理论和算法是计算线性代数的令人兴奋的新方向,其影响可能超出该项目的范围。该奖项反映了NSF的法定任务,并被认为是通过基金会的智力优点和更广泛的影响审查标准通过评估来获得的支持。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach
- DOI:10.48550/arxiv.2305.19391
- 发表时间:2023-05
- 期刊:
- 影响因子:0
- 作者:Tri Nguyen;Shahana Ibrahim;Xiao Fu
- 通讯作者:Tri Nguyen;Shahana Ibrahim;Xiao Fu
Mixed Membership Graph Clustering via Systematic Edge Query
- DOI:10.1109/tsp.2021.3109380
- 发表时间:2020-11
- 期刊:
- 影响因子:5.4
- 作者:Shahana Ibrahim;Xiao Fu
- 通讯作者:Shahana Ibrahim;Xiao Fu
Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization
通过注释器共现插补和可证明对称非负矩阵分解进行众包
- DOI:
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Ibrahim, Shahana;Fu, Xiao
- 通讯作者:Fu, Xiao
Learning Mixed Membership from Adjacency Graph Via Systematic Edge Query: Identifiability and Algorithm
- DOI:10.1109/icassp39728.2021.9413541
- 发表时间:2021-06
- 期刊:
- 影响因子:0
- 作者:Shahana Ibrahim;Xiao Fu
- 通讯作者:Shahana Ibrahim;Xiao Fu
Deep Learning From Crowdsourced Labels: Coupled Cross-entropy Minimization, Identifiability, and Regularization
- DOI:10.48550/arxiv.2306.03288
- 发表时间:2023-06
- 期刊:
- 影响因子:0
- 作者:Shahana Ibrahim;Tri Nguyen;Xiao Fu
- 通讯作者:Shahana Ibrahim;Tri Nguyen;Xiao Fu
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Xiao Fu其他文献
Tensor-Based Parameter Estimation of Double Directional Massive Mimo Channel with Dual-Polarized Antennas
基于张量的双极化天线双向大规模MIMO信道参数估计
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Cheng Qian;Xiao Fu;N. Sidiropoulos;Ye Yang - 通讯作者:
Ye Yang
Non-uniform directional dictionary-based limited feedback for massive MIMO systems
大规模 MIMO 系统中基于非均匀方向字典的有限反馈
- DOI:
- 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Panos N. Alevizos;Xiao Fu;N. Sidiropoulos;Ye Yang;A. Bletsas - 通讯作者:
A. Bletsas
Understanding gay tourists’ involvement and loyalty towards Thailand: The perspective of motivation-opportunity-ability
了解同性恋游客对泰国的参与和忠诚度:动机-机会-能力的视角
- DOI:
10.1177/13567667221147318 - 发表时间:
2023 - 期刊:
- 影响因子:5.1
- 作者:
Xinyi Liu;Xiao Fu;Yue Yuan;Zhiyong Li;Chattharika Suknuch - 通讯作者:
Chattharika Suknuch
Evaluating the Cranfield Paradigm for Conversational Search Systems
评估会话搜索系统的克兰菲尔德范式
- DOI:
10.1145/3539813.3545126 - 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Xiao Fu;Emine Yilmaz;Aldo Lipani - 通讯作者:
Aldo Lipani
Using Petroleum and Biomass-Derived Fuels in Duel-fuel Diesel Engines
在双燃料柴油发动机中使用石油和生物质衍生燃料
- DOI:
10.1007/978-81-322-2211-8_11 - 发表时间:
2014 - 期刊:
- 影响因子:0
- 作者:
S. Aggarwal;Xiao Fu - 通讯作者:
Xiao Fu
Xiao Fu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Xiao Fu', 18)}}的其他基金
CIF: Small: Latent Neural Factor Models for Radio Cartography From Bits
CIF:小:来自 Bits 的无线电制图的潜在神经因子模型
- 批准号:
2210004 - 财政年份:2022
- 资助金额:
$ 39.89万 - 项目类别:
Standard Grant
CAREER: Nonlinear Factor Analysis for Sensing and Learning
职业:传感和学习的非线性因子分析
- 批准号:
2144889 - 财政年份:2022
- 资助金额:
$ 39.89万 - 项目类别:
Continuing Grant
CCSS: Block-term Tensor Tools for Multi-aspect Sensing and Analysis
CCSS:用于多方面传感和分析的块项张量工具
- 批准号:
2024058 - 财政年份:2020
- 资助金额:
$ 39.89万 - 项目类别:
Standard Grant
Collaborative Research: MLWiNS: ANN for Interference Limited Wireless Networks
合作研究:MLWiNS:干扰有限无线网络的 ANN
- 批准号:
2003082 - 财政年份:2020
- 资助金额:
$ 39.89万 - 项目类别:
Standard Grant
Collaborative Research: Multimodal Sensing and Analytics at Scale: Algorithms and Applications
协作研究:大规模多模态传感和分析:算法和应用
- 批准号:
1808159 - 财政年份:2018
- 资助金额:
$ 39.89万 - 项目类别:
Standard Grant
相似国自然基金
靶向Treg-FOXP3小分子抑制剂的筛选及其在肺癌免疫治疗中的作用和机制研究
- 批准号:32370966
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
化学小分子激活YAP诱导染色质可塑性促进心脏祖细胞重编程的表观遗传机制研究
- 批准号:82304478
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
靶向小胶质细胞的仿生甘草酸纳米颗粒构建及作用机制研究:脓毒症相关性脑病的治疗新策略
- 批准号:82302422
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
HMGB1/TLR4/Cathepsin B途径介导的小胶质细胞焦亡在新生大鼠缺氧缺血脑病中的作用与机制
- 批准号:82371712
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
小分子无半胱氨酸蛋白调控生防真菌杀虫活性的作用与机理
- 批准号:32372613
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
相似海外基金
Dicer DNA nickase activity and its role in anti-viral immunity in human cells
Dicer DNA 切口酶活性及其在人体细胞抗病毒免疫中的作用
- 批准号:
10724622 - 财政年份:2023
- 资助金额:
$ 39.89万 - 项目类别:
Extra-hepatic postprandial metabolism of dietary fructose
膳食果糖的肝外餐后代谢
- 批准号:
10614587 - 财政年份:2022
- 资助金额:
$ 39.89万 - 项目类别:
Structure of Malaria Parasite RNA polymerase
疟疾寄生虫 RNA 聚合酶的结构
- 批准号:
10433276 - 财政年份:2022
- 资助金额:
$ 39.89万 - 项目类别:
Extra-hepatic postprandial metabolism of dietary fructose
膳食果糖的肝外餐后代谢
- 批准号:
10418420 - 财政年份:2022
- 资助金额:
$ 39.89万 - 项目类别:
Structure of Malaria Parasite RNA polymerase
疟疾寄生虫 RNA 聚合酶的结构
- 批准号:
10552645 - 财政年份:2022
- 资助金额:
$ 39.89万 - 项目类别: