Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data

结构化和半结构化不一致数据的可扩展清理、集成和分析

基本信息

批准号：
RGPIN-2019-04068
负责人：
Ilyas, Ihab
金额：
$ 2.99万
依托单位：
University of Waterloo
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2021
资助国家：
加拿大
起止时间：
2021-01-01 至 2022-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=737383
关键词：
Scalable Cleaning Integration Analysis Structured

Scalable Cleaning Integration Analysis Structured

项目摘要

Enterprises in all verticals (e.g., healthcare, financial services, manufacturers, and insurance companies) have been aggressively collecting data from a variety of sources including customers, transactions, sensors and social data to build the ultimate data asset. The hope is that by employing appropriate analysis techniques, this data can provide insights, directions, and findings that increase their customer satisfaction; achieve higher profit margins; or even inspire the creation of new lines of business or enable new discoveries. Unfortunately, what prevents this fine vision from being a pervasive reality is the data itself; dirty and siloed data is the norm rather than the exception. Consequently, data curation, cleaning and integration become key enablers to the big promise of effective data science. An article in the New York Times (August of 2014) indicated that for data scientists, "cleaning" is key hurdle to insights. Large scale data cleaning to enable data science is the main goal of this proposal. Data cleaning is often described by a set of activities including finding and fixing anomalies and outliers, imputing missing values, and deduplicating records representing the same entity. The main objective is to prepare data to be mined and analyzed by a variety of tools to produce high quality aggregates and insights. The task of curating and integrating large amounts of data presents real theoretical and engineering challenges. Most current proposals suffer from fundamental problems that hinder any of these solutions from being deployed in practical industry and business settings. I propose to conduct fundamental research in data quality leading to solutions (new technologies, methods and algorithms) that can be deployed in real environments. The main objective is to enable quality-aware analytics on and retrieval from large-scale inconsistent and dirty data sources, unleashing the potential of data science. Some of the fundamental challenges in achieving this objective, which we intend to investigate, include: (1) developing efficient profiling and repair solutions that scale to large data sets; (2) addressing the privacy concerns around sensitive data by developing privacy-aware exploration, error detection, and repair framework; (3) modelling data cleaning as large scale statistical inference problem that takes into account all available signals including business rules, master data and various statistical properties; (4) studying practical variants of the outlier detection problem; and (5) investigate the quality issues in integrating unstructured data (such as text), with structured relational data, including revisiting information extraction systems to include quality constraints. The proposed techniques will be implemented and tested in multiple open-source system prototypes, including HoloClean, our recent system for machine learning-based data cleaning.

所有垂直行业（例如医疗保健，金融服务，制造商和保险公司）中的企业一直在积极从包括客户，交易，传感器和社交数据在内的各种来源收集数据，以建立最终的数据资产。希望通过采用适当的分析技术，这些数据可以提供洞察力，指示和发现，从而提高客户满意度；实现更高的利润率；甚至激发创建新的业务范围或实现新发现的创建。不幸的是，阻止这种良好愿景成为普遍存在的现实的是数据本身。肮脏和孤立的数据是规范而不是例外。因此，数据策展，清洁和集成成为有效数据科学的巨大希望的关键推动者。《纽约时报》（2014年8月）上的一篇文章表明，对于数据科学家来说，“清洁”是见解的关键障碍。大规模数据清洁以实现数据科学是该提案的主要目标。数据清洁通常由一组活动描述，包括查找和修复异常和异常值，归纳缺失值以及代表相同实体的记录。主要目的是准备多种工具来挖掘和分析的数据，以产生高质量的聚集体和见解。策划和集成大量数据的任务提出了真正的理论和工程挑战。当前的大多数建议都遭受了基本问题，这些问题阻碍了这些解决方案中的任何一个。我建议对可以在实际环境中部署的解决方案（新技术，方法和算法）进行数据质量进行基础研究。主要目的是在大规模不一致和肮脏的数据源中启用质量感知的分析，并释放数据科学的潜力。我们打算调查的目标方面的一些基本挑战包括：（1）开发有效的分析和维修解决方案，以扩展到大型数据集；（2）通过开发隐私探索，错误检测和维修框架来解决敏感数据周围的隐私问题；（3）将数据清洁建模为大规模统计推断问题，该问题考虑了所有可用的信号，包括业务规则，主数据和各种统计属性；（4）研究异常检测问题的实用变体；（5）调查将非结构化数据（例如文本）与结构化关系数据集成的质量问题，包括重新审视信息提取系统以包括质量约束。提出的技术将在多个开源系统原型中实施和测试，包括我们最近用于基于机器的数据清洁系统的Holoclean。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

暂无数据

数据更新时间：2024-06-01

Ilyas, Ihab的其他基金

Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data

结构化和半结构化不一致数据的可扩展清理、集成和分析

批准号：
RGPIN-2019-04068
RGPIN-2019-04068
财政年份：
2022
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Discovery Grants Program - Individual
Discovery Grants Program - Individual

NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning

NSERC/汤森路透数据清理工业研究主席

批准号：
534011-2017
534011-2017
财政年份：
2021
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Industrial Research Chairs
Industrial Research Chairs

End-to-end Extraction and Curation of Large RDF Repositories

大型 RDF 存储库的端到端提取和管理

批准号：
543961-2019
543961-2019
财政年份：
2020
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Collaborative Research and Development Grants
Collaborative Research and Development Grants

NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning

NSERC/汤森路透数据清理工业研究主席

批准号：
534011-2017
534011-2017
财政年份：
2020
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Industrial Research Chairs
Industrial Research Chairs

Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data

结构化和半结构化不一致数据的可扩展清理、集成和分析

批准号：
RGPIN-2019-04068
RGPIN-2019-04068
财政年份：
2020
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Discovery Grants Program - Individual
Discovery Grants Program - Individual

Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data

结构化和半结构化不一致数据的可扩展清理、集成和分析

批准号：
RGPIN-2019-04068
RGPIN-2019-04068
财政年份：
2019
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Discovery Grants Program - Individual
Discovery Grants Program - Individual

End-to-end Extraction and Curation of Large RDF Repositories

大型 RDF 存储库的端到端提取和管理

批准号：
543961-2019
543961-2019
财政年份：
2019
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Collaborative Research and Development Grants
Collaborative Research and Development Grants

NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning

NSERC/汤森路透数据清理工业研究主席

批准号：
534011-2017
534011-2017
财政年份：
2019
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Industrial Research Chairs
Industrial Research Chairs

Cleaning and Analysis of Large Uncertain and Inconsistent Data Sources

大量不确定且不一致的数据源的清理和分析

批准号：
RGPIN-2014-06143
RGPIN-2014-06143
财政年份：
2018
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Discovery Grants Program - Individual
Discovery Grants Program - Individual

NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning

NSERC/汤森路透数据清理工业研究主席

批准号：
534011-2017
534011-2017
财政年份：
2018
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Industrial Research Chairs
Industrial Research Chairs

相似国自然基金

mazG作为结核分枝杆菌毒力因子的作用机理研究

批准号：
31300126
批准年份：
2013
资助金额：
23.0 万元
项目类别：
青年科学基金项目

基于Cache的远程计时攻击研究

批准号：
60772082
批准年份：
2007
资助金额：
28.0 万元
项目类别：
面上项目

相似海外基金

Meta-Analysis of Metabolic Determinants of Exercise Response in Common Funds Data

共同基金数据中运动反应代谢决定因素的荟萃分析

批准号：
10772237
10772237
财政年份：
2023
资助金额：
$ 2.99万
$ 2.99万
项目类别：

Data Management and Analysis Core

数据管理与分析核心

批准号：
10627597
10627597
财政年份：
2023
资助金额：
$ 2.99万
$ 2.99万
项目类别：

Core 1 - Biostatistics

核心 1 - 生物统计学

批准号：
10628256
10628256
财政年份：
2023
资助金额：
$ 2.99万
$ 2.99万
项目类别：

Growing up in a digital world: A synergistic approach to understanding media use in children ages 1-8 years

在数字世界中成长：了解 1-8 岁儿童媒体使用情况的协同方法

批准号：
10701805
10701805
财政年份：
2022
资助金额：
$ 2.99万
$ 2.99万
项目类别：

Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data

结构化和半结构化不一致数据的可扩展清理、集成和分析

批准号：
RGPIN-2019-04068
RGPIN-2019-04068
财政年份：
2022
资助金额：
$ 2.99万
$ 2.99万
项目类别：
Discovery Grants Program - Individual
Discovery Grants Program - Individual