Cleaning and Analysis of Large Uncertain and Inconsistent Data Sources
大量不确定且不一致的数据源的清理和分析
基本信息
- 批准号:RGPIN-2014-06143
- 负责人:
- 金额:$ 5.54万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2018
- 资助国家:加拿大
- 起止时间:2018-01-01 至 2019-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Data generated by modern applications such as object tracking, sensor networks, health record management, and Web data integration involves uncertainty, and various anomalies such as missing values and duplication. While many research efforts have been focusing on data cleaning and dealing with inconsistent databases, very limited research has been adopted in real settings for various technical and practical challenges around increasing and extracting value from large dirty data sets. To list a few of these practical and technical challenges: (1) the protection and sensitivity of data, where data custodians and guardians prevent automatic repairing algorithms from changing the underlying data; (2) the heterogeneity of integrity constraints, which makes proposed techniques that tackle a single type of error inapplicable or ineffective in practice; (3) the lack of ground truth to validate repairing strategies; quality metrics such as minimal repairs have not been showing great results in practice; and (4) the lack of interactive tools for data quality that allow users and experts to reason about the problematic parts of the data and to explain the reasons behind these errors.**In this proposal, we focus on enabling data quality analytics and retrieval on large-scale inconsistent and dirty databases. The proposal pursues a set of research directions including (1) non-destructive data cleaning that represents and queries possible data repairs without changing the underlying data; (2) holistic data cleaning, which addresses the violations of multiple heterogeneous integrity constraints; (3) high-fidelity data repairing, which depends more on trusted data sources and experts, and depends less on heuristic quality metrics, such as minimal repairs; and (4) descriptive and prescriptive data quality analytics in practical dashboards that go beyond describing errors in the data to recommending ways to prevent future errors.**The proposed techniques will be implemented and tested in our previously developed system prototypes: UClean, a probabilistic and quality-aware database engine prototype, based on an open-source Database Management System; and NADEEF, an open source extensible data cleaning system. The goal is to build a generic framework that encapsulates efficient query processing algorithms to allow users to effectively query, analyze and explore large volumes of inconsistent and uncertain data.*The developed algorithms and dashboard will enable both the research community and industry to reason about the quality of available data sets, and to provide guidance on how to clean or enhance the quality of this data with respect to target applications or use cases.
由对象跟踪,传感器网络,健康记录管理和Web数据集成等现代应用程序生成的数据涉及不确定性以及各种异常,例如缺失值和重复。尽管许多研究工作一直集中在数据清洁和处理不一致的数据库上,但在实际环境中采用了非常有限的研究,涉及从大型脏数据集中增加和提取价值的各种技术和实践挑战。列出其中一些实际和技术挑战:(1)数据保管人和监护人的保护和敏感性,以防止自动修复算法更改基础数据; (2)完整性约束的异质性,这使得提出的技术在实践中应对单一类型的错误或无效的错误; (3)缺乏验证维修策略的地面真相;诸如最小维修之类的质量指标在实践中尚未表现出很大的结果; (4)缺乏用于数据质量的交互式工具,这些工具使用户和专家能够对数据的有问题部分进行推理,并解释这些错误背后的原因。**在此提案中,我们专注于启用数据质量分析和检索在大规模不一致和肮脏的数据库中。该提案追求一组研究方向,包括(1)代表和查询可能的数据维修的非破坏性数据清洁,而无需更改基础数据; (2)整体数据清洁,以解决多种异构完整性约束的侵犯; (3)高保真数据维修,这更多地取决于受信任的数据源和专家,并且较少取决于启发式质量指标,例如最少的维修; (4)在实用仪表板上的描述性和规范性数据质量分析不仅仅是描述数据中的错误,还推荐了防止未来错误的方法。以及基于开源数据库管理系统的质量感知数据库引擎原型;和Nadeef,一个开源可扩展的数据清洁系统。目的是建立一个通用框架,该框架封装有效的查询处理算法,以允许用户有效查询,分析和探索大量不一致和不确定的数据。可用数据集的质量,并提供有关如何清洁或增强有关目标应用程序或用例的数据质量的指导。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ilyas, Ihab其他文献
Ilyas, Ihab的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ilyas, Ihab', 18)}}的其他基金
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2022
- 资助金额:
$ 5.54万 - 项目类别:
Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2021
- 资助金额:
$ 5.54万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2021
- 资助金额:
$ 5.54万 - 项目类别:
Industrial Research Chairs
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2020
- 资助金额:
$ 5.54万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2020
- 资助金额:
$ 5.54万 - 项目类别:
Industrial Research Chairs
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2020
- 资助金额:
$ 5.54万 - 项目类别:
Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2019
- 资助金额:
$ 5.54万 - 项目类别:
Discovery Grants Program - Individual
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2019
- 资助金额:
$ 5.54万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2019
- 资助金额:
$ 5.54万 - 项目类别:
Industrial Research Chairs
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2018
- 资助金额:
$ 5.54万 - 项目类别:
Industrial Research Chairs
相似国自然基金
非均匀风场下考虑侧向运动的大跨桥梁非线性颤振分析方法与风洞试验研究
- 批准号:52308480
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
数理与机理协同驱动的软黏土深大基坑变形与稳定性分析方法研究
- 批准号:52308392
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
含叠层多气室大尺度ETFE气枕在复杂动态荷载与作用下的结构分析理论
- 批准号:
- 批准年份:2022
- 资助金额:54 万元
- 项目类别:面上项目
大跨高铁桥梁抖振可靠度分析的深度生成网络和深度强化学习融合方法
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
面向大单位字的对称密码算法安全性分析关键问题研究
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
CalR: A toolkit and repository for experiments of energy homeostasis using indirect calorimetry
CalR:使用间接量热法进行能量稳态实验的工具包和存储库
- 批准号:
10544759 - 财政年份:2022
- 资助金额:
$ 5.54万 - 项目类别:
CalR: A toolkit and repository for experiments of energy homeostasis using indirect calorimetry
CalR:使用间接量热法进行能量稳态实验的工具包和存储库
- 批准号:
10338235 - 财政年份:2022
- 资助金额:
$ 5.54万 - 项目类别:
Enhancing Assisted Reproductive Technologies with Deep Learning and Data Visualization
通过深度学习和数据可视化增强辅助生殖技术
- 批准号:
10376335 - 财政年份:2021
- 资助金额:
$ 5.54万 - 项目类别: