Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
基本信息
- 批准号:RGPIN-2019-04068
- 负责人:
- 金额:$ 2.99万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2021
- 资助国家:加拿大
- 起止时间:2021-01-01 至 2022-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Enterprises in all verticals (e.g., healthcare, financial services, manufacturers, and insurance companies) have been aggressively collecting data from a variety of sources including customers, transactions, sensors and social data to build the ultimate data asset. The hope is that by employing appropriate analysis techniques, this data can provide insights, directions, and findings that increase their customer satisfaction; achieve higher profit margins; or even inspire the creation of new lines of business or enable new discoveries. Unfortunately, what prevents this fine vision from being a pervasive reality is the data itself; dirty and siloed data is the norm rather than the exception. Consequently, data curation, cleaning and integration become key enablers to the big promise of effective data science. An article in the New York Times (August of 2014) indicated that for data scientists, "cleaning" is key hurdle to insights. Large scale data cleaning to enable data science is the main goal of this proposal. Data cleaning is often described by a set of activities including finding and fixing anomalies and outliers, imputing missing values, and deduplicating records representing the same entity. The main objective is to prepare data to be mined and analyzed by a variety of tools to produce high quality aggregates and insights. The task of curating and integrating large amounts of data presents real theoretical and engineering challenges. Most current proposals suffer from fundamental problems that hinder any of these solutions from being deployed in practical industry and business settings. I propose to conduct fundamental research in data quality leading to solutions (new technologies, methods and algorithms) that can be deployed in real environments. The main objective is to enable quality-aware analytics on and retrieval from large-scale inconsistent and dirty data sources, unleashing the potential of data science. Some of the fundamental challenges in achieving this objective, which we intend to investigate, include: (1) developing efficient profiling and repair solutions that scale to large data sets; (2) addressing the privacy concerns around sensitive data by developing privacy-aware exploration, error detection, and repair framework; (3) modelling data cleaning as large scale statistical inference problem that takes into account all available signals including business rules, master data and various statistical properties; (4) studying practical variants of the outlier detection problem; and (5) investigate the quality issues in integrating unstructured data (such as text), with structured relational data, including revisiting information extraction systems to include quality constraints. The proposed techniques will be implemented and tested in multiple open-source system prototypes, including HoloClean, our recent system for machine learning-based data cleaning.
所有垂直行业的企业(例如医疗保健、金融服务、制造商和保险公司)一直在积极从各种来源收集数据,包括客户、交易、传感器和社交数据,以构建最终的数据资产。希望通过采用适当的分析技术,这些数据可以提供见解、方向和发现,从而提高客户满意度;实现更高的利润率;甚至激发新业务线的创建或实现新发现。不幸的是,阻碍这一美好愿景成为普遍现实的是数据本身。脏数据和孤立数据是常态,而不是例外。因此,数据管理、清理和集成成为实现有效数据科学这一巨大前景的关键推动因素。 《纽约时报》(2014 年 8 月)的一篇文章指出,对于数据科学家来说,“清理”是获取见解的关键障碍。该提案的主要目标是大规模数据清理以实现数据科学。数据清理通常由一组活动来描述,包括查找和修复异常和离群值、估算缺失值以及对表示同一实体的重复记录进行删除。主要目标是准备要通过各种工具挖掘和分析的数据,以产生高质量的聚合和见解。整理和集成大量数据的任务提出了真正的理论和工程挑战。目前的大多数提案都存在根本性问题,阻碍了这些解决方案在实际工业和商业环境中的部署。我建议对数据质量进行基础研究,从而找到可以在实际环境中部署的解决方案(新技术、方法和算法)。主要目标是实现对大规模不一致和脏数据源的质量感知分析和检索,释放数据科学的潜力。我们打算研究实现这一目标的一些基本挑战,包括:(1)开发可扩展到大型数据集的高效分析和修复解决方案; (2) 通过开发隐私意识探索、错误检测和修复框架来解决敏感数据的隐私问题; (3) 将数据清理建模为大规模统计推理问题,考虑所有可用信号,包括业务规则、主数据和各种统计属性; (4) 研究异常值检测问题的实际变体; (5) 研究将非结构化数据(例如文本)与结构化关系数据集成时的质量问题,包括重新审视信息提取系统以纳入质量约束。所提出的技术将在多个开源系统原型中实施和测试,包括 HoloClean,我们最近的基于机器学习的数据清理系统。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ilyas, Ihab其他文献
Ilyas, Ihab的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ilyas, Ihab', 18)}}的其他基金
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2021
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
Cleaning and Analysis of Large Uncertain and Inconsistent Data Sources
大量不确定且不一致的数据源的清理和分析
- 批准号:
RGPIN-2014-06143 - 财政年份:2018
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2018
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
相似国自然基金
mazG作为结核分枝杆菌毒力因子的作用机理研究
- 批准号:31300126
- 批准年份:2013
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
基于Cache的远程计时攻击研究
- 批准号:60772082
- 批准年份:2007
- 资助金额:28.0 万元
- 项目类别:面上项目
相似海外基金
Meta-Analysis of Metabolic Determinants of Exercise Response in Common Funds Data
共同基金数据中运动反应代谢决定因素的荟萃分析
- 批准号:
10772237 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Growing up in a digital world: A synergistic approach to understanding media use in children ages 1-8 years
在数字世界中成长:了解 1-8 岁儿童媒体使用情况的协同方法
- 批准号:
10701805 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual