Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data

结构化和半结构化不一致数据的可扩展清理、集成和分析

基本信息

批准号：
RGPIN-2019-04068
负责人：
Ilyas, Ihab
金额：
$ 2.99万
依托单位：
University of Waterloo
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2019
资助国家：
加拿大
起止时间：
2019-01-01 至 2020-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=688355
关键词：
Scalable Cleaning Integration Analysis Structured

项目摘要

Enterprises in all verticals (e.g., healthcare, financial services, manufacturers, and insurance companies) have been aggressively collecting data from a variety of sources including customers, transactions, sensors and social data to build the ultimate data asset. The hope is that by employing appropriate analysis techniques, this data can provide insights, directions, and findings that increase their customer satisfaction; achieve higher profit margins; or even inspire the creation of new lines of business or enable new discoveries. Unfortunately, what prevents this fine vision from being a pervasive reality is the data itself; dirty and siloed data is the norm rather than the exception. Consequently, data curation, cleaning and integration become key enablers to the big promise of effective data science. An article in the New York Times (August of 2014) indicated that for data scientists, "cleaning" is key hurdle to insights. Large scale data cleaning to enable data science is the main goal of this proposal.******Data cleaning is often described by a set of activities including finding and fixing anomalies and outliers, imputing missing values, and deduplicating records representing the same entity. The main objective is to prepare data to be mined and analyzed by a variety of tools to produce high quality aggregates and insights. The task of curating and integrating large amounts of data presents real theoretical and engineering challenges. Most current proposals suffer from fundamental problems that hinder any of these solutions from being deployed in practical industry and business settings.******I propose to conduct fundamental research in data quality leading to solutions (new technologies, methods and algorithms) that can be deployed in real environments. The main objective is to enable quality-aware analytics on and retrieval from large-scale inconsistent and dirty data sources, unleashing the potential of data science. Some of the fundamental challenges in achieving this objective, which we intend to investigate, include: (1) developing efficient profiling and repair solutions that scale to large data sets; (2) addressing the privacy concerns around sensitive data by developing privacy-aware exploration, error detection, and repair framework; (3) modelling data cleaning as large scale statistical inference problem that takes into account all available signals including business rules, master data and various statistical properties; (4) studying practical variants of the outlier detection problem; and (5) investigate the quality issues in integrating unstructured data (such as text), with structured relational data, including revisiting information extraction systems to include quality constraints. The proposed techniques will be implemented and tested in multiple open-source system prototypes, including HoloClean, our recent system for machine learning-based data cleaning.

所有垂直行业（例如医疗保健，金融服务，制造商和保险公司）中的企业一直在积极从包括客户，交易，传感器和社交数据在内的各种来源收集数据，以建立最终的数据资产。希望通过采用适当的分析技术，这些数据可以提供洞察力，指示和发现，从而提高客户满意度；实现更高的利润率；甚至激发创建新的业务范围或实现新发现的创建。不幸的是，阻止这种良好愿景成为普遍存在的现实的是数据本身。肮脏和孤立的数据是规范而不是例外。因此，数据策展，清洁和集成成为有效数据科学的巨大希望的关键推动者。《纽约时报》（2014年8月）上的一篇文章表明，对于数据科学家来说，“清洁”是见解的关键障碍。大规模数据清洁以实现数据科学是该提案的主要目标。******数据清洁通常由一组活动来描述，包括查找和修复异常和异常值，推出缺失值以及代表相同实体的记录。主要目的是准备多种工具来挖掘和分析的数据，以产生高质量的聚集体和见解。策划和集成大量数据的任务提出了真正的理论和工程挑战。当前的大多数提案都遭受了基本问题的困扰，这些问题阻碍了这些解决方案被部署在实用行业和业务环境中。主要目的是在大规模不一致和肮脏的数据源中启用质量感知的分析，并释放数据科学的潜力。我们打算调查的目标方面的一些基本挑战包括：（1）开发有效的分析和维修解决方案，以扩展到大型数据集；（2）通过开发隐私探索，错误检测和维修框架来解决敏感数据周围的隐私问题；（3）将数据清洁建模为大规模统计推断问题，该问题考虑了所有可用的信号，包括业务规则，主数据和各种统计属性；（4）研究异常检测问题的实用变体；（5）调查将非结构化数据（例如文本）与结构化关系数据集成的质量问题，包括重新审视信息提取系统以包括质量约束。提出的技术将在多个开源系统原型中实施和测试，包括我们最近用于基于机器的数据清洁系统的Holoclean。