AF: Small: Algorithms and Information Theory for Causal Inference

AF：小：因果推理的算法和信息论

基本信息

批准号：
1618795
负责人：
Leonard Schulman
金额：
$ 45万
依托单位：
California Institute of Technology
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2016
资助国家：
美国
起止时间：
2016-08-01 至 2020-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1618795&HistoricalAwards=false
关键词：
AF Small Algorithms Information Theory

项目摘要

This project is concerned, firstly, with algorithmic and information-theoretic aspects of Causal Inference. With the exception of some scientific data that is gathered purely for knowledge, most data is gathered for the purpose of potential intervention: this holds for medicine, public health, environmental regulations, market research, legal remedies for discrimination, and in many other domains. A decision-maker cannot take advantage of correlations and other structural characterizations that are discovered in data without knowing about causal relationships between variables. Historically, causality has been teased apart from correlation through controlled experiments. However there are several good reasons that one must often make do with passive observation: ethical reasons; governance constraints; and uniqueness of the system and the inability to re-run history. Absent experiments, we are without the principal arsenal of the scientific method.Yet there is a special class of systems in which it is possible to perform causality inference purely from passive observation of the statistics. For a system to fall in this class one must be able to establish on physical grounds that certain observable variables are statistically independent of certain others, conditional on a third set being held fixed; the formalism for this is ``semi-Markovian graphical models". It is known which semi-Markovian models fall in this class, subject to the assumption of perfect statistics. From this starting point there remain significant theoretical challenges before these ideas can have the greatest possible impact on practice. Some of the challenges to be addressed include:(1) The PI will aim to quantify how the stability (condition number) of causal identification depends on the various sources of uncertainty (statistical error; numerical error; model error) and as a function of the structure of the graphical model. The purpose is both to understand what inference is justifiable from existing data, and to impact study design so that data with the greatest leverage is collected. For the former objective, in particular, the PI seeks an efficient algorithm to compute the condition number of a given semi-Markovian model at the specific observed statistics. For the last objective the PI seeks an efficient algorithm to compute the worst-case condition number of a given semi-Markovian model.(2) Existing causal identification algorithms, applied to data inconsistent with the model (which is unavoidable due to statistical error, and normally also due to model error), will yield an inference inconsistent with the model. The project will help to understand if projection onto the model may improve stability.(3) One of the obstacles to use of existing methods is that they require sample size exponential in the size of the graphical model. The project aims to determine when it is possible to infer causality using only the marginal distributions over small subsets of the observable variables; this will reduce sample size and likely improve condition number.(4) In the majority of semi-Markovian models, causality is not identifiable. This leaves open however the possibility of determining (or giving a nontrivial outer bound for) the feasible interval of causal effects. No effective algorithm is currently known for this problem, and we wish to provide one. Such an algorithm could be used to show that an intervention is favorable despite the effect not being fully identifiable.(5) The project aims to lift the causal-inference algorithm to time series, as well as study the connections with the distinct techniques (Granger causality and Massey's directed information) normally used in this setting.Secondary emphases of the project include broader research in theoretical computer science. In particular, studying connections between ``boosting" or ``multiplicative weights" methods used in algorithms and machine learning, and their variants which arise out of selection or self-interest in the system dynamics of ecosystems (``weak selection") and economic marketplaces (``tatonnement").Inseparably from the research effort, the PI will train students and postdocs in these and related areas of the theory of computation.

该项目首先涉及因果推理的算法和信息论方面。除了一些纯粹为了获取知识而收集的科学数据外，大多数数据都是为了潜在干预的目的而收集的：这适用于医学、公共卫生、环境法规、市场研究、针对歧视的法律补救措施以及许多其他领域。如果不了解变量之间的因果关系，决策者就无法利用数据中发现的相关性和其他结构特征。从历史上看，人们通过受控实验将因果关系与相关性区分开来。然而，有几个充分的理由，人们必须经常采取被动观察：道德原因；治理限制；系统的独特性以及无法重新运行历史。如果没有实验，我们就没有科学方法的主要武器库。然而，有一类特殊的系统，可以纯粹通过对统计数据的被动观察来进行因果关系推断。对于属于此类的系统，我们必须能够基于物理基础确定某些可观察变量在统计上独立于某些其他变量，条件是第三组保持固定；其形式主义是“半马尔可夫图形模型”。众所周知，哪些半马尔可夫模型属于此类，取决于完美统计的假设。从这个出发点，在这些想法能够得到应用之前，仍然存在重大的理论挑战。需要解决的一些挑战包括：(1) PI 的目标是量化因果识别的稳定性（条件数）如何取决于各种不确定性来源（统计误差、数值误差、模型误差）。）和作为图形模型结构的函数，目的既是为了了解从现有数据中得出的推论是合理的，也是为了影响研究设计，以便收集具有最大影响力的数据，尤其是 PI 寻求的数据。一种有效的算法来计算给定半马尔可夫模型在特定观察统计量下的条件数。对于最后一个目标，PI 寻求一种有效的算法来计算给定半马尔可夫模型的最坏情况条件数。(2)现有因果识别算法应用于与模型不一致的数据（由于统计误差而不可避免，通常也由于模型误差），将产生与模型不一致的推论。该项目将有助于了解投影到模型上是否可以提高稳定性。（3）使用现有方法的障碍之一是它们需要图形模型大小的指数样本大小。该项目旨在确定何时可以仅使用可观察变量小子集的边际分布来推断因果关系；这将减少样本量并可能改善条件数。(4) 在大多数半马尔可夫模型中，因果关系是不可识别的。然而，这留下了确定（或给出一个重要的外部界限）因果效应的可行区间的可能性。目前还没有有效的算法来解决这个问题，我们希望提供一个。这样的算法可用于表明干预措施是有利的，尽管效果无法完全识别。(5) 该项目旨在将因果推理算法提升到时间序列，并研究与不同技术的联系（Granger）因果关系和梅西定向信息）通常在这种情况下使用。该项目的次要重点包括理论计算机科学领域更广泛的研究。特别是，研究算法和机器学习中使用的“增强”或“乘法权重”方法之间的联系，及其因生态系统系统动力学中的选择或自身利益而产生的变体（“弱选择”）和经济市场（“tatonnement”）。与研究工作密不可分的是，PI 将在计算理论的这些领域和相关领域对学生和博士后进行培训。