RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases
RealPDB:大规模概率数据库的现实数据模型和查询编译
基本信息
- 批准号:EP/R013667/1
- 负责人:
- 金额:$ 99.55万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2017
- 资助国家:英国
- 起止时间:2017 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web. The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.
近年来,人们对学术界和行业有浓厚的兴趣,他们以自动化的方式从数据中构建大规模的概率知识库,这导致了许多系统,例如DeepDive,Nell,Yago,Yago,Freebase,Microsoft,Microsoft的Progase和Google的知识库。这些系统连续爬网并提取结构化信息,从而用数百万个实体和数十亿个元组填充了其数据库。这些搜索和提取系统在多大程度上可以帮助现实世界中的用例?事实证明,这是一个开放式的问题。例如,DeepDive用于为古生物学,地质,医学遗传学和人类运动等领域建立知识库。从更广泛的角度来看,建立大规模知识库的追求是人工智能研究的新曙光。诸如信息提取,自然语言处理(例如,问答),关系和深度学习,知识表示和推理以及数据库等领域正在实现共同目标。查询大规模的概率知识基础通常被认为是这些努力的核心。但是,所有这些成功案例都不是所有这些成功的案例,但是,概率知识基础仍然缺乏基本的机制,无法将隐藏在其中的一些有价值的知识传达给最终用户,这严重限制了他们在实践中的潜在应用。这些问题植根于(元素独立于)概率数据库的语义,这些数据库用于编码大多数概率知识库。出于计算效率的原因,概率数据库通常基于强,不现实的完整性假设,例如封闭世界的假设,元组独立假设和缺乏常识性知识。这些强大的不切实际的假设不仅会导致不必要的后果,而且在知识基础学习,完成和查询方面将概率数据库放在弱地基础上。更具体地说,上述每个系统仅编码现实世界的一部分,并且这种描述必然是不完整的。这些系统不断地爬网,遇到新来源,并因此是新事实,导致他们将这些事实添加到其数据库中。但是,当涉及查询时,这些系统中的大多数都采用了封闭世界的假设,即,在数据库中不存在的任何事实都被分配了概率0,因此假定是不可能的。作为一个密切相关的问题,普遍的做法是将每个提取的事实视为一个独立的Bernoulli变量,即任何两个事实在概率上都是独立的。例如,电影中出演的一个人独立于这个人是演员,这一事实与知识领域的基本本质相抵触。此外,当前的概率数据库缺乏(特别是本体论的)常识性知识,这些知识通常可以在推理中推断出来自数据的隐性后果,这通常对于在不受控制的环境(例如网络)(例如网络)中查询大规模概率数据库至关重要。该提案的主要目标是通过更现实的数据模型增强大规模的概率数据库(因此,以解锁其完整的数据建模潜力),同时保留其计算属性。我们计划为所得的概率数据库开发不同的语义,并分析其计算特性和棘手性的来源。我们还计划为它们设计实用的可扩展查询回答算法,尤其是基于知识汇编技术的算法,扩展了现有的知识汇编方法,并基于张力分解和神经符号知识汇编阐述了新的知识汇编方法。我们还将生成原型实施,并通过实验评估所提出的算法。
项目成果
期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs
- DOI:10.4230/lipics.icdt.2020.5
- 发表时间:2019-10
- 期刊:
- 影响因子:0
- 作者:Antoine Amarilli;I. Ceylan
- 通讯作者:Antoine Amarilli;I. Ceylan
An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities
- DOI:10.1016/j.ins.2021.02.018
- 发表时间:2021-02
- 期刊:
- 影响因子:0
- 作者:Elvira Amador-Domínguez;E. Serrano;Daniel Manrique;Patrick Hohenecker;Thomas Lukasiewicz
- 通讯作者:Elvira Amador-Domínguez;E. Serrano;Daniel Manrique;Patrick Hohenecker;Thomas Lukasiewicz
Combining RDF and SPARQL with CP-theories to reason about preferences in a Linked Data setting
- DOI:10.3233/sw-180339
- 发表时间:2020-04
- 期刊:
- 影响因子:3
- 作者:V. W. Anelli;R. Leone;T. D. Noia;Thomas Lukasiewicz;Jessica Rosati
- 通讯作者:V. W. Anelli;R. Leone;T. D. Noia;Thomas Lukasiewicz;Jessica Rosati
Approximate weighted model integration on DNF structures
DNF 结构上的近似加权模型集成
- DOI:10.1016/j.artint.2022.103753
- 发表时间:2022
- 期刊:
- 影响因子:14.4
- 作者:Abboud R
- 通讯作者:Abboud R
The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs
评估概率图上同态闭查询的二分法
- DOI:10.46298/lmcs-18(1:2)2022
- 发表时间:2022
- 期刊:
- 影响因子:0.6
- 作者:Amarilli A
- 通讯作者:Amarilli A
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Thomas Lukasiewicz其他文献
Fuzzy Description Logic Programs under the Answer Set Semantics for the Semantic Web
- DOI:
10.4018/jswis.2008070104 - 发表时间:
2006-11 - 期刊:
- 影响因子:0
- 作者:
Thomas Lukasiewicz - 通讯作者:
Thomas Lukasiewicz
Uncertainty Representation and Reasoning in the Semantic Web
- DOI:
10.4018/978-1-60566-112-4.ch013 - 发表时间:
2008 - 期刊:
- 影响因子:0
- 作者:
Thomas Lukasiewicz - 通讯作者:
Thomas Lukasiewicz
Hybrid Deep-Semantic Matrix Factorization for Tag-Aware Personalized Recommendation
用于标签感知个性化推荐的混合深度语义矩阵分解
- DOI:
10.1109/icassp40776.2020.9053044 - 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Zhenghua Xu;Cheng Chen;Thomas Lukasiewicz;Yishu Miao - 通讯作者:
Yishu Miao
Complexity results for preference aggregation over (m)CP-nets: Max and rank voting
(m)CP 网络偏好聚合的复杂性结果:最大投票和排名投票
- DOI:
- 发表时间:
2021 - 期刊:
- 影响因子:14.4
- 作者:
Thomas Lukasiewicz;Enrico Malizia - 通讯作者:
Enrico Malizia
Complexity Results for Preference Aggregation over (m)CP-nets: Pareto and Majority Voting
(m)CP 网络上的偏好聚合的复杂性结果:帕累托和多数投票
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:14.4
- 作者:
Thomas Lukasiewicz;Enrico Malizia - 通讯作者:
Enrico Malizia
Thomas Lukasiewicz的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Thomas Lukasiewicz', 18)}}的其他基金
PrOQAW: Probabilistic Ontological Query Answering on the Web
ProOQAW:网络上的概率本体查询回答
- 批准号:
EP/J008346/1 - 财政年份:2012
- 资助金额:
$ 99.55万 - 项目类别:
Research Grant
相似国自然基金
不完全市场下科技创新替代效应对实际汇率的影响
- 批准号:72303152
- 批准年份:2023
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
面向实际应用的测量设备无关类量子密钥分发协议研究
- 批准号:62371244
- 批准年份:2023
- 资助金额:53.00 万元
- 项目类别:面上项目
基于界面实际状态的粗糙表面静摩擦多尺度研究
- 批准号:12302141
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
实际大气条件下前体物对二次有机气溶胶的生成贡献研究
- 批准号:
- 批准年份:2023
- 资助金额:51 万元
- 项目类别:
面向实际3D半导体器件模拟的网格划分与算法研究
- 批准号:12371413
- 批准年份:2023
- 资助金额:44.00 万元
- 项目类别:面上项目
相似海外基金
Realistic quantification of potential privacy loss from genomic summary results
从基因组摘要结果中实际量化潜在隐私损失
- 批准号:
10540473 - 财政年份:2022
- 资助金额:
$ 99.55万 - 项目类别:
Realistic quantification of potential privacy loss from genomic summary results
从基因组摘要结果中实际量化潜在隐私损失
- 批准号:
10616768 - 财政年份:2022
- 资助金额:
$ 99.55万 - 项目类别:
Advancing Bio-Realistic Modeling via the Brain Modeling ToolKit and SONATA Data Format
通过大脑建模工具包和 SONATA 数据格式推进生物真实建模
- 批准号:
10306896 - 财政年份:2021
- 资助金额:
$ 99.55万 - 项目类别:
A machine learning ultrasound beamformer based on realistic wave physics for high body mass index imaging
基于真实波物理学的机器学习超声波束形成器,用于高体重指数成像
- 批准号:
10595030 - 财政年份:2021
- 资助金额:
$ 99.55万 - 项目类别:
Advancing Bio-Realistic Modeling via the Brain Modeling ToolKit and SONATA Data Format
通过大脑建模工具包和 SONATA 数据格式推进生物真实建模
- 批准号:
10477439 - 财政年份:2021
- 资助金额:
$ 99.55万 - 项目类别: