III: Small: A New Machine Learning Approach for Improved Entity Identification

III：小：改进实体识别的新机器学习方法

基本信息

批准号：
1815538
负责人：
Shivaram Venkataraman
金额：
$ 32.04万
依托单位：
University of Wisconsin-Madison
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-09-01 至 2022-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1815538&HistoricalAwards=false
关键词：
III Small New Machine Learning

项目摘要

Modern analytics rely on data integration to combine heterogeneous data into a unified repository they can tap into for insights, services, and scientific knowledge. The typical goal of data integration is to combine heterogeneous data about the same real-world entity into a canonical representation of that entity. Traditionally, entity canonicalization methods focus on structured data and leverage the semantics of the schema accompanying the data to come up with canonical entity representations. This dependency on data semantics makes existing entity canonicalization methods inapplicable to dark data, i.e., operational data that corresponds to unstructured, noisy, and incomplete data. This project will develop entity canonicalization methods that focus on unstructured and semi-structured data and are suitable for large-scale integration applications. This work will help ease the currently challenging procedure of heuristically consolidating matching information about the same entity into unified representations and thus enable dark data to be more effectively used in downstream analytics applications.The emphasis of this work is on entity canonicalization techniques that leverage representation learning (a.k.a. feature learning) and deep learning. The combination of distributed representations with deep architectures has emerged as the de facto standard for analyzing and processing unstructured data. This project will develop new deep learning architectures for: (1) record linkage, i.e., clustering unstructured data records that provide information about the same entity; and (2) data fusion, i.e., combining matching unstructured records into a canonical representation of the underlying entity. For record linkage, this work will introduce new deep learning techniques that capture multi-context domain-specific knowledge to learn the semantic similarity between records. For data fusion, this project will design new multi-sequence to one-sequence encoder-decoder recurrent neural networks for data fusion with a particular focus on incomplete data. The outcomes of this project have the potential to advance the state-of-the-art in large scale data integration methods as well as machine learning methods for high-dimensional, sparse, and noisy data.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代分析依靠数据集成将异质数据结合到可以利用的统一存储库中，以获取见解，服务和科学知识。数据整合的典型目标是将有关同一现实世界实体的异质数据结合到该实体的规范表示中。传统上，实体规范化方法着眼于结构化数据，并利用数据伴随数据的架构的语义来提出规范实体表示。对数据语义的这种依赖性使现有的实体规范化方法不适用黑暗数据，即与非结构化，嘈杂和不完整数据相对应的操作数据。该项目将开发针对非结构化和半结构化数据的实体规范化方法，适用于大规模集成应用程序。这项工作将有助于缓解当前充满挑战的程序，即在统一表示中启发有关同一实体的匹配信息，从而使黑数据在下游分析应用程序中更有效地使用。这项工作的重点在于实体规范化的规范技术，这些技术利用了表现形式学习（A.K.A.功能学习）和深度学习和深度学习。分布式表示与深度体系结构的结合已成为分析和处理非结构化数据的事实上的标准。该项目将开发出：（1）记录链接的新深度学习体系结构，即，将提供有关同一实体的信息的非结构化数据记录集群；（2）数据融合，即将匹配的非结构化记录组合为基础实体的规范表示。对于记录链接，这项工作将引入新的深度学习技术，以捕获多上下文特定领域的知识，以了解记录之间的语义相似性。对于数据融合，该项目将设计新的多序列到一个序列编码器重复的神经网络，以进行数据融合，以特别关注不完整的数据。该项目的结果有可能在大规模数据集成方法中推进最先进的方法，以及用于高维，稀疏和嘈杂数据的机器学习方法。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的智力优点和更广泛影响的评估来进行评估的值得支持的。

项目成果

期刊论文数量（8）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

DOI：
10.14778/3494124.3494149
发表时间：
2021-06
期刊：
ArXiv
影响因子：
0
作者：
S. Suri;Ihab F. Ilyas;Christopher R'e;Theodoros Rekatsinas
通讯作者：
S. Suri;Ihab F. Ilyas;Christopher R'e;Theodoros Rekatsinas

Demo of Marius: A System for Large-scale Graph Embeddings

Marius 演示：大规模图嵌入系统

DOI：
发表时间：
2021
期刊：
Proceedings of the VLDB Endowment
影响因子：
2.5
作者：
Carlsson, Anders;Xie, Anze;Mohoney, Jason;Waleffe, Roger;Peters, Shanan;Rekatsinas, Theodoros;Venkataraman, Shivaram
通讯作者：
Venkataraman, Shivaram

MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks

DOI：
10.1145/3552326.3567501
发表时间：
2022-02
期刊：
Proceedings of the Eighteenth European Conference on Computer Systems
影响因子：
0
作者：
R. Waleffe;J. Mohoney;Theodoros Rekatsinas;S. Venkataraman
通讯作者：
R. Waleffe;J. Mohoney;Theodoros Rekatsinas;S. Venkataraman

Picket: guarding against corrupted data in tabular data during learning and inference

DOI：
10.1007/s00778-021-00699-w
发表时间：
2020-06
期刊：
The VLDB Journal
影响因子：
0
作者：
Zifan Liu;Zhechun Zhou;Theodoros Rekatsinas
通讯作者：
Zifan Liu;Zhechun Zhou;Theodoros Rekatsinas

Unsupervised Relation Extraction from Language Models using Constrained Cloze Completion

DOI：
10.18653/v1/2020.findings-emnlp.113
发表时间：
2020-10
期刊：
ArXiv
影响因子：
0
作者：
Ankur Goswami;Akshata Bhat;Hadar Ohana;Theodoros Rekatsinas
通讯作者：
Ankur Goswami;Akshata Bhat;Hadar Ohana;Theodoros Rekatsinas

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Shivaram Venkataraman其他文献

CHAI: Clustered Head Attention for Efficient LLM Inference

CHAI：用于高效 LLM 推理的集群头注意力

DOI：
发表时间：
2024
期刊：
arXiv.org
影响因子：
0
作者：
Saurabh Agarwal;Bilge Acun;Basil Homer;Mostafa Elhoushi;Yejin Lee;Shivaram Venkataraman;Dimitris Papailiopoulos;Carole
通讯作者：
Carole