RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases

RealPDB：大规模概率数据库的现实数据模型和查询编译

基本信息

批准号：
EP/R013667/1
负责人：
Thomas Lukasiewicz
金额：
$ 99.55万
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2017
资助国家：
英国
起止时间：
2017 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FR013667%2F1
关键词：
RealPDBs Realistic Data Models Query

项目摘要

In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web. The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.

近年来，人们对学术界和行业有浓厚的兴趣，他们以自动化的方式从数据中构建大规模的概率知识库，这导致了许多系统，例如DeepDive，Nell，Yago，Yago，Freebase，Microsoft，Microsoft的Progase和Google的知识库。这些系统连续爬网并提取结构化信息，从而用数百万个实体和数十亿个元组填充了其数据库。这些搜索和提取系统在多大程度上可以帮助现实世界中的用例？事实证明，这是一个开放式的问题。例如，DeepDive用于为古生物学，地质，医学遗传学和人类运动等领域建立知识库。从更广泛的角度来看，建立大规模知识库的追求是人工智能研究的新曙光。诸如信息提取，自然语言处理（例如，问答），关系和深度学习，知识表示和推理以及数据库等领域正在实现共同目标。查询大规模的概率知识基础通常被认为是这些努力的核心。但是，所有这些成功案例都不是所有这些成功的案例，但是，概率知识基础仍然缺乏基本的机制，无法将隐藏在其中的一些有价值的知识传达给最终用户，这严重限制了他们在实践中的潜在应用。这些问题植根于（元素独立于）概率数据库的语义，这些数据库用于编码大多数概率知识库。出于计算效率的原因，概率数据库通常基于强，不现实的完整性假设，例如封闭世界的假设，元组独立假设和缺乏常识性知识。这些强大的不切实际的假设不仅会导致不必要的后果，而且在知识基础学习，完成和查询方面将概率数据库放在弱地基础上。更具体地说，上述每个系统仅编码现实世界的一部分，并且这种描述必然是不完整的。这些系统不断地爬网，遇到新来源，并因此是新事实，导致他们将这些事实添加到其数据库中。但是，当涉及查询时，这些系统中的大多数都采用了封闭世界的假设，即，在数据库中不存在的任何事实都被分配了概率0，因此假定是不可能的。作为一个密切相关的问题，普遍的做法是将每个提取的事实视为一个独立的Bernoulli变量，即任何两个事实在概率上都是独立的。例如，电影中出演的一个人独立于这个人是演员，这一事实与知识领域的基本本质相抵触。此外，当前的概率数据库缺乏（特别是本体论的）常识性知识，这些知识通常可以在推理中推断出来自数据的隐性后果，这通常对于在不受控制的环境（例如网络）（例如网络）中查询大规模概率数据库至关重要。该提案的主要目标是通过更现实的数据模型增强大规模的概率数据库（因此，以解锁其完整的数据建模潜力），同时保留其计算属性。我们计划为所得的概率数据库开发不同的语义，并分析其计算特性和棘手性的来源。我们还计划为它们设计实用的可扩展查询回答算法，尤其是基于知识汇编技术的算法，扩展了现有的知识汇编方法，并基于张力分解和神经符号知识汇编阐述了新的知识汇编方法。我们还将生成原型实施，并通过实验评估所提出的算法。

项目成果

期刊论文数量（10）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs

DOI：
10.4230/lipics.icdt.2020.5
发表时间：
2019-10
期刊：
ArXiv
影响因子：
0
作者：
Antoine Amarilli;I. Ceylan
通讯作者：
Antoine Amarilli;I. Ceylan

An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities

DOI：
10.1016/j.ins.2021.02.018
发表时间：
2021-02
期刊：
Inf. Sci.
影响因子：
0
作者：
Elvira Amador-Domínguez;E. Serrano;Daniel Manrique;Patrick Hohenecker;Thomas Lukasiewicz
通讯作者：
Elvira Amador-Domínguez;E. Serrano;Daniel Manrique;Patrick Hohenecker;Thomas Lukasiewicz

Combining RDF and SPARQL with CP-theories to reason about preferences in a Linked Data setting

DOI：
10.3233/sw-180339
发表时间：
2020-04
期刊：
Semantic Web
影响因子：
3
作者：
V. W. Anelli;R. Leone;T. D. Noia;Thomas Lukasiewicz;Jessica Rosati
通讯作者：
V. W. Anelli;R. Leone;T. D. Noia;Thomas Lukasiewicz;Jessica Rosati

Approximate weighted model integration on DNF structures

DNF 结构上的近似加权模型集成

DOI：
10.1016/j.artint.2022.103753
发表时间：
2022
期刊：
Artificial Intelligence
影响因子：
14.4
作者：
Abboud R
通讯作者：
Abboud R

The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs

评估概率图上同态闭查询的二分法

DOI：
10.46298/lmcs-18(1:2)2022
发表时间：
2022
期刊：
Logical Methods in Computer Science
影响因子：
0.6
作者：
Amarilli A
通讯作者：
Amarilli A

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Thomas Lukasiewicz其他文献

Fuzzy Description Logic Programs under the Answer Set Semantics for the Semantic Web

DOI：
10.4018/jswis.2008070104
发表时间：
2006-11
期刊：
2006 Second International Conference on Rules and Rule Markup Languages for the Semantic Web (RuleML'06)
影响因子：
0
作者：
Thomas Lukasiewicz
通讯作者：
Thomas Lukasiewicz

Uncertainty Representation and Reasoning in the Semantic Web

DOI：
10.4018/978-1-60566-112-4.ch013
发表时间：
2008
期刊：
影响因子：
0
作者：
Thomas Lukasiewicz
通讯作者：
Thomas Lukasiewicz

Hybrid Deep-Semantic Matrix Factorization for Tag-Aware Personalized Recommendation

用于标签感知个性化推荐的混合深度语义矩阵分解

DOI：
10.1109/icassp40776.2020.9053044
发表时间：
2017
期刊：
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
Zhenghua Xu;Cheng Chen;Thomas Lukasiewicz;Yishu Miao
通讯作者：
Yishu Miao