Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
基本信息
- 批准号:RGPIN-2019-04068
- 负责人:
- 金额:$ 2.99万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2019
- 资助国家:加拿大
- 起止时间:2019-01-01 至 2020-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Enterprises in all verticals (e.g., healthcare, financial services, manufacturers, and insurance companies) have been aggressively collecting data from a variety of sources including customers, transactions, sensors and social data to build the ultimate data asset. The hope is that by employing appropriate analysis techniques, this data can provide insights, directions, and findings that increase their customer satisfaction; achieve higher profit margins; or even inspire the creation of new lines of business or enable new discoveries. Unfortunately, what prevents this fine vision from being a pervasive reality is the data itself; dirty and siloed data is the norm rather than the exception. Consequently, data curation, cleaning and integration become key enablers to the big promise of effective data science. An article in the New York Times (August of 2014) indicated that for data scientists, "cleaning" is key hurdle to insights. Large scale data cleaning to enable data science is the main goal of this proposal.******Data cleaning is often described by a set of activities including finding and fixing anomalies and outliers, imputing missing values, and deduplicating records representing the same entity. The main objective is to prepare data to be mined and analyzed by a variety of tools to produce high quality aggregates and insights. The task of curating and integrating large amounts of data presents real theoretical and engineering challenges. Most current proposals suffer from fundamental problems that hinder any of these solutions from being deployed in practical industry and business settings.******I propose to conduct fundamental research in data quality leading to solutions (new technologies, methods and algorithms) that can be deployed in real environments. The main objective is to enable quality-aware analytics on and retrieval from large-scale inconsistent and dirty data sources, unleashing the potential of data science. Some of the fundamental challenges in achieving this objective, which we intend to investigate, include: (1) developing efficient profiling and repair solutions that scale to large data sets; (2) addressing the privacy concerns around sensitive data by developing privacy-aware exploration, error detection, and repair framework; (3) modelling data cleaning as large scale statistical inference problem that takes into account all available signals including business rules, master data and various statistical properties; (4) studying practical variants of the outlier detection problem; and (5) investigate the quality issues in integrating unstructured data (such as text), with structured relational data, including revisiting information extraction systems to include quality constraints. The proposed techniques will be implemented and tested in multiple open-source system prototypes, including HoloClean, our recent system for machine learning-based data cleaning.
所有垂直行业(例如医疗保健,金融服务,制造商和保险公司)中的企业一直在积极从包括客户,交易,传感器和社交数据在内的各种来源收集数据,以建立最终的数据资产。希望通过采用适当的分析技术,这些数据可以提供洞察力,指示和发现,从而提高客户满意度;实现更高的利润率;甚至激发创建新的业务范围或实现新发现的创建。不幸的是,阻止这种良好愿景成为普遍存在的现实的是数据本身。肮脏和孤立的数据是规范而不是例外。因此,数据策展,清洁和集成成为有效数据科学的巨大希望的关键推动力。 《纽约时报》(2014年8月)上的一篇文章表明,对于数据科学家来说,“清洁”是见解的关键障碍。大规模数据清洁以启用数据科学是该提案的主要目标。******数据清洁通常由一系列活动描述,包括查找和修复异常和异常值和异常情况,推出缺失的值,并重复数据重复记录代表相同的记录实体。主要目的是准备多种工具来挖掘和分析的数据,以产生高质量的聚集体和见解。策划和集成大量数据的任务提出了真正的理论和工程挑战。当前的大多数建议都遭受了基本问题的困扰,这些问题阻碍了这些解决方案中的任何一个。可以部署在实际环境中。主要目的是在大规模不一致和肮脏的数据源中启用质量感知的分析,并释放数据科学的潜力。我们打算调查的目标方面的一些基本挑战包括:(1)开发有效的分析和维修解决方案,以扩展到大型数据集; (2)通过开发隐私探索,错误检测和维修框架来解决敏感数据周围的隐私问题; (3)将数据清洁建模为大规模统计推断问题,该问题考虑了所有可用的信号,包括业务规则,主数据和各种统计属性; (4)研究异常检测问题的实用变体; (5)调查将非结构化数据(例如文本)与结构化关系数据集成的质量问题,包括重新审视信息提取系统以包括质量约束。 提出的技术将在多个开源系统原型中实施和测试,包括我们最近用于基于机器的数据清洁系统的Holoclean。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ilyas, Ihab其他文献
Ilyas, Ihab的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ilyas, Ihab', 18)}}的其他基金
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2021
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2021
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
Cleaning and Analysis of Large Uncertain and Inconsistent Data Sources
大量不确定且不一致的数据源的清理和分析
- 批准号:
RGPIN-2014-06143 - 财政年份:2018
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2018
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
相似国自然基金
mazG作为结核分枝杆菌毒力因子的作用机理研究
- 批准号:31300126
- 批准年份:2013
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
基于Cache的远程计时攻击研究
- 批准号:60772082
- 批准年份:2007
- 资助金额:28.0 万元
- 项目类别:面上项目
相似海外基金
Meta-Analysis of Metabolic Determinants of Exercise Response in Common Funds Data
共同基金数据中运动反应代谢决定因素的荟萃分析
- 批准号:
10772237 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Growing up in a digital world: A synergistic approach to understanding media use in children ages 1-8 years
在数字世界中成长:了解 1-8 岁儿童媒体使用情况的协同方法
- 批准号:
10701805 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual