III: Small: Labeling Massive Data from Noisy, Incomplete and Crowdsourced Annotations

III：小：标记来自嘈杂、不完整和众包注释的海量数据

基本信息

批准号：
2007836
负责人：
Xiao Fu
金额：
$ 39.89万
依托单位：
Oregon State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-10-01 至 2024-09-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2007836&HistoricalAwards=false
关键词：
III Small Labeling Massive Data

项目摘要

Alongside the prosperity of deep learning, the demand for reliably labeled data is unprecedentedly high. Label acquisition is a highly nontrivial task---data labeling is tedious, labor-intensive, and prone to mistakes. Crowdsourcing techniques that integrate annotations from multiple annotators to improve accuracy have been essential for labeling large-scale data. However, existing crowdsourcing techniques face pressing challenges such as heavy workload of annotators, high computational cost, and a lack of strong theoretical guarantees. This project will develop a series of analytical and computational tools for accurately labeling massive datasets from noisy, incomplete, and crowdsourced annotations---with provable guarantees. Leveraging advanced nonnegative matrix factorization theory, this project will offer solutions that are efficient and effective under critical conditions. The outcomes are expected to have broad and substantial positive impacts on the currently label-hungry artificial intelligence industry and the data annotation workforce. For example, the algorithms designed for handling structured data (e.g., speech) will largely benefit timely applications, e.g., intelligent assistants such as Alexa and Siri. The ability of reliably working under largely incomplete data will help design new data dispatch schemes leading to significantly reduced annotator workload. The project will also offer many training opportunities for undergraduate students, with an emphasis on engaging those from underrepresented groups.In terms of theory and methods, many aspects of crowdsourced data labeling (e.g., sample complexity, noise robustness, and identifiability of the underlying statistical model) are still poorly understood. This project will provide a suite of theoretical and computational tools that advance these aspects. To be specific, the first thrust will build up a coupled nonnegative matrix factorization (CNMF) framework that bridges the classic Dawid-Skene model for crowdsourcing and advanced nonnegative factor analysis theories. This will establish firm theoretical foundations for crowdsourcing under critical conditions, and lead to theory-backed algorithms to attain substantially improved sample complexity and noise/incomplete data robustness. The second thrust exploits domain-dependent knowledge, e.g., data structure and annotator dependence, to come up with situation-aware crowdsourcing techniques for enhanced performance. The third thrust designs stochastic optimization strategies to provide scalable implementations for the CNMF framework, and evaluates the proposed methods over a variety of real-world applications. The analytical and computational tools developed in this project will provide strong provable guarantees and refreshing algorithmic solutions for long-standing challenges in crowdsourced data labeling. In addition, the CNMF theory and algorithms are exciting new directions for computational linear algebra, whose impacts can go well beyond this project.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

除了深度学习的繁荣之外，对可靠标记的数据的需求是前所未有的。标签采集是一项高度不足的任务 - 数据标记是乏味，实验室密集型，容易出现错误。将从多个注释者整合以提高准确性的注释的众包技术对于标记大规模数据至关重要。但是，现有的攀登技术面临着紧迫的挑战，例如大量的注释者，高计算成本以及缺乏强大的理论保证。该项目将开发一系列分析和计算工具，以通过噪声，不完整和人群注释准确地标记大量数据集，并提供可证明的保证。利用先进的非负矩阵分解理论，该项目将提供在关键条件下高效有效的解决方案。预计结果将对目前渴望的含标签的人工智能行业和数据注释劳动力产生广泛而实质性的积极影响。例如，设计用于处理结构化数据的算法（例如，语音）将在很大程度上受益于及时的应用，例如Alexa和Siri等智能助手。在很大程度上不完整的数据下可靠工作的能力将有助于设计新的数据调度方案，从而大大减少注释器工作量。该项目还将为本科生提供许多培训机会，重点是与人数不足的组中的培训机会。根据理论和方法，众包数据标签的许多方面（例如，样本复杂性，噪声稳健性以及基本统计模型的样本复杂性，噪声稳健性和认同）仍然不足以理解。该项目将提供一系列理论和计算工具，以推进这些方面。具体来说，第一个推力将构建一个耦合的非负矩阵分解（CNMF）框架，该框架桥接了经典的Dawid-skene模型，用于众包和高级非负因素分析理论。这将在关键条件下为众包建立牢固的理论基础，并导致理论支持的算法，以实现显着改善样品复杂性和噪声/不完全数据鲁棒性。第二个推力利用域依赖性知识，例如数据结构和注释依赖性，以提出情境感知的众包技术以增强性能。第三个推力设计随机优化策略，为CNMF框架提供可扩展的实现，并在各种现实世界应用程序上评估了所提出的方法。该项目中开发的分析和计算工具将为众包数据标记中的长期挑战提供可证明的可证明的保证和刷新算法解决方案。此外，CNMF理论和算法是计算线性代数的令人兴奋的新方向，其影响可能超出该项目的范围。该奖项反映了NSF的法定任务，并被认为是通过基金会的智力优点和更广泛的影响审查标准通过评估来获得的支持。

项目成果

期刊论文数量（5）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach

DOI：
10.48550/arxiv.2305.19391
发表时间：
2023-05
期刊：
ArXiv
影响因子：
0
作者：
Tri Nguyen;Shahana Ibrahim;Xiao Fu
通讯作者：
Tri Nguyen;Shahana Ibrahim;Xiao Fu

Mixed Membership Graph Clustering via Systematic Edge Query

DOI：
10.1109/tsp.2021.3109380
发表时间：
2020-11
期刊：
IEEE Transactions on Signal Processing
影响因子：
5.4
作者：
Shahana Ibrahim;Xiao Fu
通讯作者：
Shahana Ibrahim;Xiao Fu

Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization

通过注释器共现插补和可证明对称非负矩阵分解进行众包

DOI：
发表时间：
2021
期刊：
Proceedings of the 38th International Conference on Machine Learning
影响因子：
0
作者：
Ibrahim, Shahana;Fu, Xiao
通讯作者：
Fu, Xiao

Learning Mixed Membership from Adjacency Graph Via Systematic Edge Query: Identifiability and Algorithm

DOI：
10.1109/icassp39728.2021.9413541
发表时间：
2021-06
期刊：
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
Shahana Ibrahim;Xiao Fu
通讯作者：
Shahana Ibrahim;Xiao Fu

Deep Learning From Crowdsourced Labels: Coupled Cross-entropy Minimization, Identifiability, and Regularization

DOI：
10.48550/arxiv.2306.03288
发表时间：
2023-06
期刊：
ArXiv
影响因子：
0
作者：
Shahana Ibrahim;Tri Nguyen;Xiao Fu
通讯作者：
Shahana Ibrahim;Tri Nguyen;Xiao Fu

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Xiao Fu其他文献

Tensor-Based Parameter Estimation of Double Directional Massive Mimo Channel with Dual-Polarized Antennas

基于张量的双极化天线双向大规模MIMO信道参数估计

DOI：
发表时间：
2018
期刊：
IEEE International Conference on Acoustics, Speech, and Signal Processing
影响因子：
0
作者：
Cheng Qian;Xiao Fu;N. Sidiropoulos;Ye Yang
通讯作者：
Ye Yang

Non-uniform directional dictionary-based limited feedback for massive MIMO systems

大规模 MIMO 系统中基于非均匀方向字典的有限反馈

DOI：
发表时间：
2017
期刊：
International Symposium on Modeling and Optimization in Mobile, Ad-Hoc and Wireless Networks
影响因子：
0
作者：
Panos N. Alevizos;Xiao Fu;N. Sidiropoulos;Ye Yang;A. Bletsas
通讯作者：
A. Bletsas

Understanding gay tourists’ involvement and loyalty towards Thailand: The perspective of motivation-opportunity-ability

了解同性恋游客对泰国的参与和忠诚度：动机-机会-能力的视角

DOI：
10.1177/13567667221147318
发表时间：
2023
期刊：
Journal of Vacation Marketing
影响因子：
5.1
作者：
Xinyi Liu;Xiao Fu;Yue Yuan;Zhiyong Li;Chattharika Suknuch
通讯作者：
Chattharika Suknuch

Evaluating the Cranfield Paradigm for Conversational Search Systems

评估会话搜索系统的克兰菲尔德范式

DOI：
10.1145/3539813.3545126
发表时间：
2022
期刊：
Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval
影响因子：
0
作者：
Xiao Fu;Emine Yilmaz;Aldo Lipani
通讯作者：
Aldo Lipani