NSF-BSF: AF: Small: Identifying Functional Structure in Data

NSF-BSF：AF：小：识别数据中的功能结构

基本信息

批准号：
1909972
负责人：
Leonard Schulman
金额：
$ 40万
依托单位：
California Institute of Technology
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-10-01 至 2023-09-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1909972&HistoricalAwards=false
关键词：
NSF BSF AF Small Identifying

项目摘要

Algorithms for unsupervised learning tasks are in constant use in science, engineering, medical research, etc. Progress in optimization procedures for these problems has therefore the potential for very wide adoption. One of the primary research topics of this award is unsupervised learning methods for determining causal effect in scenarios where one cannot, for practical or ethical reasons, perform controlled experiments. At present, the theory for deducing causal effects is somewhat brittle, since it relies upon unrealistic accuracy in the mathematical modeling, and, for large systems, upon unrealistic sample sizes. Scientists therefore do not have very reliable ways of advising the public about "what-if" scenarios. This research aims to provide, and characterize the performance of, causal-identification algorithms with the following goals: (a) Algorithms that work under weaker assumptions than current theory makes (in particular regarding the fidelity of the model to reality), yet deliver still-useful guarantees; (b) Algorithms that work under stronger structural assumptions than current theory makes and can use smaller sample sizes as a result. A related research topic concerns approximation algorithms, and inapproximability bounds, for optimization problems such as data clustering and learning of mixture models. The investigator will train students and postdocs in these and related areas of the theory of computation.Causal-inference problems will be addressed through the framework of directed probabilistic graphical models. Existing causal-identification algorithms operate with a naive parameter size because, for N-variate graphical models, the complexity parameter (and sample complexity) is exponential in N. By contrast all algorithmic methods for clustering, for learning mixtures of product distributions, or for topic models regard the dimension (i.e., the number of variables N) as a parameter that should not appear, or at worst appear poly-logarithmically, in the exponent. This project aims to provide algorithms with such complexity; however, this is not only an algorithmic question. The existing characterization of when causal identification is even possible is stated under the assumption that one has access to full knowledge of the joint distribution on the N variables. Therefore a related, statistical but not algorithmic goal is to obtain a characterization of graphical models in which causal inference is possible only from the joint distributions of sets of variables of size small relative to N. A candidate class of such models comes from mixture models, beginning with mixtures of products and extending to mixtures of Markov models. Another question concerns graphical models in which causal identification is not possible, yet the range of causal effects consistent with the data is small. In such cases, an algorithm which bounds the causal effect is an adequate replacement for full identification; providing such guarantees is another goal of this work. Finally, research will focus on clustering with squared Euclidean costs: both on the inapproximability threshold and in improving approximation factors through linear programming relaxations or other methods.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

无监督学习任务的算法一直在科学，工程，医学研究等中不断使用。因此，针对这些问题的优化程序的进展具有非常广泛的采用。该奖项的主要研究主题之一是无监督的学习方法，用于确定因实际或道德原因无法执行受控实验的情况下的因果效应。目前，推论因果效应的理论有些脆弱，因为它依赖于数学建模的不切实际准确性，而对于大型系统，依赖于不切实际的样本量。因此，科学家没有非常可靠的方式向公众提供有关“假设”情况的建议。这项研究旨在提供并表征其因果识别算法的性能，其目标以下目标：（a）在假设较弱的假设下起作用的算法（尤其是关于模型对现实的忠诚度），但提供仍然有利用的保证；（b）在结构假设下起作用的算法比当前理论更强的结构假设，因此可以使用较小的样本量。一个相关的研究主题涉及近似算法和不合适的界限，以进行优化问题，例如数据聚类和混合模型的学习。研究者将在计算理论的这些和相关领域培训学生和博士后。提示问题将通过定向概率图形模型的框架来解决。现有的因果识别算法以幼稚的参数大小运行，因为对于n变量图形模型，n中的复杂性参数（和样品复杂性）在n中是指数的。在指数中，多元素。该项目旨在提供如此复杂的算法。但是，这不仅是一个算法问题。假设人们可以访问N变量上的关节分布的全部知识，即使因果鉴定的现有表征甚至可能是可能的。因此，一个相关的统计但不是算法目标是获得图形模型的表征，在该图形模型中，只有从相对于N的大小相对于N的尺寸小变量集的联合分布仅来自于N.候选模型的混合模型，从而从产品混合物开始，并扩展到Markov模型的混合物。另一个问题涉及无法因果鉴定的图形模型，但与数据一致的因果效应范围很小。在这种情况下，界定因果效应的算法是充分识别的足够替代品。提供这种保证是这项工作的另一个目标。最后，研究将集中于以平方的欧几里得成本进行聚类：无论是在无Ximibility阈值上，还是通过线性编程松弛或其他方法来改善近似因素。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的智力和更广泛影响的评估来通过评估来支持的，这是值得通过的支持。