Algorithms for Data Analysis

数据分析算法

基本信息

批准号：
0515342
负责人：
Leonard Schulman
金额：
--
依托单位：
California Institute of Technology
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2005
资助国家：
美国
起止时间：
2005-06-01 至 2009-05-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0515342&HistoricalAwards=false
关键词：
Algorithms Data Analysis

Algorithms Data Analysis

项目摘要

Intellectual merit of the proposed activity. The primary focus of this investigation is the following ubiquitous algorithmic challenge: Given a large data set, provide a "simple" yet "accurate" description of the data.The question is treated at several levels. In the first it is considered as a data-clustering problem. A clustering algorithm (designed with regard to a particular "distortion" that measures dissimilarity of points) partitions the data into clusters so that the distortion between points within a cluster is minimized (i.e., similarity is maximized), while the distortion across clusters is maximized. In the huge databases prevalent in scientific, engineering, business, government and medical applications, "polynomial time algorithms" are not good enough since a typical algorithm running in quadratic or cubic time is too slow to be executed on gigabytes or terabytes of data. This proposal therefore places an emphasis on near-linear time algorithms. A particular novel approach, recently developed by the proposer, is described.In the second treatment, more advanced data analysis problems are addressed. In these problems richer and more flexible models for the data are allowed, such as mixture models of several low-degree surfaces. Data analysis in these models is beyond the reach of current approximation algorithms. The data analysis requires partitioning of the data, and so the NP-completeness of clustering is inherited; but in addition, new difficulties are inherited from applied statistics, such as the issue of missing or incomplete records in real world data. The combination requires solution of fundamental new algorithmic and geometric questions.The third part of the proposal is concerned not with a particular clustering problem but with setting up the right kind of theory for such nonparametric inference problems. The fundamental principle is this: We should seek to cluster only data sets that admit a high-quality clustering. Often, what makes a particular input hard for a clustering algorithm is that its best clustering is little better than average: the input does not separate cleanly into well-separated pieces. A good data analysis should detect this fact, rather than stall on a search for a best yet near-average clustering. Developing a sound theoretical framework for clustering therefore requires departing from the "worst-case analysis" framework that has been dominant so far.The fourth part of the proposal is devoted to theoretical analysis of the EM clustering algorithm and its variants. EM is an iterative heuristic for which there are few performance guarantees. However, it is fast, and widely used in practice. Because of this, an important goal is to determine under what conditions EM performs well, and what conditions require other approaches.The fifth part of the proposal is devoted to a different topic: the limitations imposed on feedbackControl mechanisms because of limited processing power or limited communication capacity. Because of the real-time nature of control tasks, conventional "input-output" circuit complexity is not an appropriate theoretical framework. Two general types of questions are pursued. The first is characterization of the class of stabilization tasks that can be performed by highly parallelized, ultra-fast digital control circuits. The second is the ensuring of reliable and real-time communications among components of a control system in spite of noise on the communication channels. Broader impact of the proposed activity. Data analysis is required in many scientific, engineering, business, government and medical applications. The importance of sifting out useful information from huge databases is such, that advances in data analysis algorithms as well as in the mathematical framework underpinning the design of these algorithms, has significant potential benefit to society.Increasing automation in our technological infrastructure has created a convergence of control, computation and communication technologies. Some of the difficulties associated with carrying out this convergence will be resolved through progress on the questions described in the last section of the proposal.All topics in the proposal are highly suited to graduate, and in some cases undergraduate, research. Interdisciplinary student training in algorithms, complexity, control and information theory is ongoing (by the PI and several colleagues). Increase in this activity is intended during the period of the proposed investigation.

拟议活动的智力优点。这项研究的主要重点是以下无处不在的算法挑战：鉴于大数据集，提供了对数据的“简单”但“准确”的描述。该问题在多个层面上进行了处理。在第一个中，它被视为数据群集问题。聚类算法（针对测量点差异的特定“失真”设计）将数据分配到簇中，以便最大化群集内点之间的变形（即最大化相似性），而跨簇跨簇的失真是最大化的。在科学，工程，商业，政府和医疗应用中普遍存在的巨大数据库中，“多项式时间算法”还不够好，因为在二次或立方时间运行的典型算法太慢，无法在千兆字节或数据数据的数据上执行。因此，该提案重点是近线性时间算法。描述了提议者最近开发的一种特殊的新方法。在第二种处理中，解决了更高级的数据分析问题。在这些问题中，允许更丰富，更灵活的数据模型，例如几个低度表面的混合模型。这些模型中的数据分析超出了当前近似算法的范围。数据分析需要对数据进行分区，因此群集的NP完整性是继承的。但是，此外，新的困难也是从应用统计数据中继承的，例如现实世界数据中缺少或不完整的记录问题。该组合需要解决基本的新算法和几何问题的解决方案。该提案的第三部分不涉及特定的聚类问题，而是针对此类非参数推断问题设置正确的理论。基本原则是：我们应该寻求仅集群的数据集，这些数据集接受高质量的聚类。通常，对于聚类算法而言，特定的输入很难的是，其最佳聚类比平均水平好：输入不会干净地分离为分离良好的部分。良好的数据分析应该检测到这一事实，而不是搜索最佳但接近平均水平的聚类。因此，开发一个合理的理论框架进行聚类需要远离迄今为止占主导地位的“最坏情况分析”框架。该提案的第四部分致力于对EM聚类算法及其变体的理论分析。 em是一种迭代启发式，几乎没有性能保证。但是，它很快，在实践中广泛使用。因此，一个重要的目标是确定在什么条件下表现良好，以及哪些条件需要其他方法。该提案的第五部分专门用于另一个主题：由于处理能力有限或沟通能力有限，因此对反馈控制机制施加的限制。由于控制任务的实时性质，常规的“输入输出”电路复杂性不是一个合适的理论框架。提出了两种一般类型的问题。首先是对高度平行的超快速数字控制电路可以执行的一系列稳定任务的表征。第二个是确保控制系统组件之间可靠和实时的通信，尽管通信渠道上有噪声。拟议活动的更广泛影响。许多科学，工程，商业，政府和医疗应用都需要数据分析。从巨大的数据库中筛选有用信息的重要性是，数据分析算法的进步以及基于这些算法设计的数学框架中的进步，在我们的技术基础结构中为社会提供了重要的潜在利益。在提案的最后一部分中所述的问题上，将解决与执行这种融合有关的一些困难。该提案中的所有主题都非常适合毕业，在某些情况下，本科生的研究。跨学科的学生培训在算法，复杂性，控制和信息理论方面正在进行中（PI和几个同事）。该活动的增加是在拟议的调查期间的。