Algorithms for Data Analysis

数据分析算法

基本信息

  • 批准号:
    0515342
  • 负责人:
  • 金额:
    --
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2005
  • 资助国家:
    美国
  • 起止时间:
    2005-06-01 至 2009-05-31
  • 项目状态:
    已结题

项目摘要

Intellectual merit of the proposed activity. The primary focus of this investigation is the following ubiquitous algorithmic challenge: Given a large data set, provide a "simple" yet "accurate" description of the data.The question is treated at several levels. In the first it is considered as a data-clustering problem. A clustering algorithm (designed with regard to a particular "distortion" that measures dissimilarity of points) partitions the data into clusters so that the distortion between points within a cluster is minimized (i.e., similarity is maximized), while the distortion across clusters is maximized. In the huge databases prevalent in scientific, engineering, business, government and medical applications, "polynomial time algorithms" are not good enough since a typical algorithm running in quadratic or cubic time is too slow to be executed on gigabytes or terabytes of data. This proposal therefore places an emphasis on near-linear time algorithms. A particular novel approach, recently developed by the proposer, is described.In the second treatment, more advanced data analysis problems are addressed. In these problems richer and more flexible models for the data are allowed, such as mixture models of several low-degree surfaces. Data analysis in these models is beyond the reach of current approximation algorithms. The data analysis requires partitioning of the data, and so the NP-completeness of clustering is inherited; but in addition, new difficulties are inherited from applied statistics, such as the issue of missing or incomplete records in real world data. The combination requires solution of fundamental new algorithmic and geometric questions.The third part of the proposal is concerned not with a particular clustering problem but with setting up the right kind of theory for such nonparametric inference problems. The fundamental principle is this: We should seek to cluster only data sets that admit a high-quality clustering. Often, what makes a particular input hard for a clustering algorithm is that its best clustering is little better than average: the input does not separate cleanly into well-separated pieces. A good data analysis should detect this fact, rather than stall on a search for a best yet near-average clustering. Developing a sound theoretical framework for clustering therefore requires departing from the "worst-case analysis" framework that has been dominant so far.The fourth part of the proposal is devoted to theoretical analysis of the EM clustering algorithm and its variants. EM is an iterative heuristic for which there are few performance guarantees. However, it is fast, and widely used in practice. Because of this, an important goal is to determine under what conditions EM performs well, and what conditions require other approaches.The fifth part of the proposal is devoted to a different topic: the limitations imposed on feedbackControl mechanisms because of limited processing power or limited communication capacity. Because of the real-time nature of control tasks, conventional "input-output" circuit complexity is not an appropriate theoretical framework. Two general types of questions are pursued. The first is characterization of the class of stabilization tasks that can be performed by highly parallelized, ultra-fast digital control circuits. The second is the ensuring of reliable and real-time communications among components of a control system in spite of noise on the communication channels. Broader impact of the proposed activity. Data analysis is required in many scientific, engineering, business, government and medical applications. The importance of sifting out useful information from huge databases is such, that advances in data analysis algorithms as well as in the mathematical framework underpinning the design of these algorithms, has significant potential benefit to society.Increasing automation in our technological infrastructure has created a convergence of control, computation and communication technologies. Some of the difficulties associated with carrying out this convergence will be resolved through progress on the questions described in the last section of the proposal.All topics in the proposal are highly suited to graduate, and in some cases undergraduate, research. Interdisciplinary student training in algorithms, complexity, control and information theory is ongoing (by the PI and several colleagues). Increase in this activity is intended during the period of the proposed investigation.
拟议活动的智力优点。这项研究的主要重点是以下无处不在的算法挑战:鉴于大数据集,提供了对数据的“简单”但“准确”的描述。该问题在多个层面上进行了处理。在第一个中,它被视为数据群集问题。聚类算法(针对测量点差异的特定“失真”设计)将数据分配到簇中,以便最大化群集内点之间的变形(即最大化相似性),而跨簇跨簇的失真是最大化的。在科学,工程,商业,政府和医疗应用中普遍存在的巨大数据库中,“多项式时间算法”还不够好,因为在二次或立方时间运行的典型算法太慢,无法在千兆字节或数据数据的数据上执行。因此,该提案重点是近线性时间算法。 描述了提议者最近开发的一种特殊的新方法。在第二种处理中,解决了更高级的数据分析问题。在这些问题中,允许更丰富,更灵活的数据模型,例如几个低度表面的混合模型。这些模型中的数据分析超出了当前近似算法的范围。数据分析需要对数据进行分区,因此群集的NP完整性是继承的。但是,此外,新的困难也是从应用统计数据中继承的,例如现实世界数据中缺少或不完整的记录问题。该组合需要解决基本的新算法和几何问题的解决方案。该提案的第三部分不涉及特定的聚类问题,而是针对此类非参数推断问题设置正确的理论。基本原则是:我们应该寻求仅集群的数据集,这些数据集接受高质量的聚类。通常,对于聚类算法而言,特定的输入很难的是,其最佳聚类比平均水平好:输入不会干净地分离为分离良好的部分。良好的数据分析应该检测到这一事实,而不是搜索最佳但接近平均水平的聚类。因此,开发一个合理的理论框架进行聚类需要远离迄今为止占主导地位的“最坏情况分析”框架。该提案的第四部分致力于对EM聚类算法及其变体的理论分析。 em是一种迭代启发式,几乎没有性能保证。但是,它很快,在实践中广泛使用。因此,一个重要的目标是确定在什么条件下表现良好,以及哪些条件需要其他方法。该提案的第五部分专门用于另一个主题:由于处理能力有限或沟通能力有限,因此对反馈控制机制施加的限制。由于控制任务的实时性质,常规的“输入输出”电路复杂性不是一个合适的理论框架。提出了两种一般类型的问题。首先是对高度平行的超快速数字控制电路可以执行的一系列稳定任务的表征。第二个是确保控制系统组件之间可靠和实时的通信,尽管通信渠道上有噪声。拟议活动的更广泛影响。许多科学,工程,商业,政府和医疗应用都需要数据分析。从巨大的数据库中筛选有用信息的重要性是,数据分析算法的进步以及基于这些算法设计的数学框架中的进步,在我们的技术基础结构中为社会提供了重要的潜在利益。在提案的最后一部分中所述的问题上,将解决与执行这种融合有关的一些困难。该提案中的所有主题都非常适合毕业,在某些情况下,本科生的研究。跨学科的学生培训在算法,复杂性,控制和信息理论方面正在进行中(PI和几个同事)。该活动的增加是在拟议的调查期间的。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

暂无数据

数据更新时间:2024-06-01

Leonard Schulman的其他基金

NSF-BSF: AF: Small: Algorithmic and Information-Theoretic Challenges in Causal Inference
NSF-BSF:AF:小:因果推理中的算法和信息论挑战
  • 批准号:
    2321079
    2321079
  • 财政年份:
    2023
  • 资助金额:
    --
    --
  • 项目类别:
    Standard Grant
    Standard Grant
NSF-BSF: AF: Small: Identifying Functional Structure in Data
NSF-BSF:AF:小:识别数据中的功能结构
  • 批准号:
    1909972
    1909972
  • 财政年份:
    2019
  • 资助金额:
    --
    --
  • 项目类别:
    Standard Grant
    Standard Grant
AF: Small: Algorithms and Information Theory for Causal Inference
AF:小:因果推理的算法和信息论
  • 批准号:
    1618795
    1618795
  • 财政年份:
    2016
  • 资助金额:
    --
    --
  • 项目类别:
    Standard Grant
    Standard Grant
AF: Small: Algorithms for Inference
AF:小:推理算法
  • 批准号:
    1319745
    1319745
  • 财政年份:
    2013
  • 资助金额:
    --
    --
  • 项目类别:
    Standard Grant
    Standard Grant
AF: EAGER: Algorithms in Linear Algebra and Optimization
AF:EAGER:线性代数和优化算法
  • 批准号:
    1038578
    1038578
  • 财政年份:
    2011
  • 资助金额:
    --
    --
  • 项目类别:
    Continuing Grant
    Continuing Grant
Collaborative Research: EMT/QIS: Quantum Algorithms and Post-Quantum Cryptography
合作研究:EMT/QIS:量子算法和后量子密码学
  • 批准号:
    0829909
    0829909
  • 财政年份:
    2008
  • 资助金额:
    --
    --
  • 项目类别:
    Continuing Grant
    Continuing Grant
SGER: Planning for a Cross-Cutting Initiative in Computational Discovery
SGER:规划计算发现的跨领域计划
  • 批准号:
    0652536
    0652536
  • 财政年份:
    2007
  • 资助金额:
    --
    --
  • 项目类别:
    Standard Grant
    Standard Grant
QnTM: Collaborative Research: Quantum Algorithms
QnTM:协作研究:量子算法
  • 批准号:
    0524828
    0524828
  • 财政年份:
    2005
  • 资助金额:
    --
    --
  • 项目类别:
    Continuing Grant
    Continuing Grant
CAREER: Computation Methods
职业:计算方法
  • 批准号:
    0049092
    0049092
  • 财政年份:
    2000
  • 资助金额:
    --
    --
  • 项目类别:
    Continuing Grant
    Continuing Grant
CAREER: Computation Methods
职业:计算方法
  • 批准号:
    9876172
    9876172
  • 财政年份:
    1999
  • 资助金额:
    --
    --
  • 项目类别:
    Continuing Grant
    Continuing Grant

相似国自然基金

基于海量数据的地震智能检测方法及其在密集地震序列中的应用研究
  • 批准号:
    42374081
  • 批准年份:
    2023
  • 资助金额:
    52.00 万元
  • 项目类别:
    面上项目
基于数据驱动的故障信息挖掘及其在工控系统中的应用研究
  • 批准号:
    62372063
  • 批准年份:
    2023
  • 资助金额:
    50.00 万元
  • 项目类别:
    面上项目
多模态边缘智能的轻量化算法与加速器架构协同优化研究
  • 批准号:
    62302529
  • 批准年份:
    2023
  • 资助金额:
    30.00 万元
  • 项目类别:
    青年科学基金项目
数据驱动下的有限时间李雅普诺夫指数的高效鲁棒算法研究
  • 批准号:
    12371433
  • 批准年份:
    2023
  • 资助金额:
    44.00 万元
  • 项目类别:
    面上项目
大规模字符序列分析的关键技术与高效算法研究
  • 批准号:
    62362016
  • 批准年份:
    2023
  • 资助金额:
    30.00 万元
  • 项目类别:
    地区科学基金项目

相似海外基金

Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
  • 批准号:
    10462257
    10462257
  • 财政年份:
    2023
  • 资助金额:
    --
    --
  • 项目类别:
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
  • 批准号:
    2415562
    2415562
  • 财政年份:
    2023
  • 资助金额:
    --
    --
  • 项目类别:
    Standard Grant
    Standard Grant
Sleep and Cardiometabolic Subgroup Discovery and Risk Prediction in United States Adolescents and Young Adults: A Multi-Study Multi-Domain Analysis of NHANES and NSRR
美国青少年和年轻人的睡眠和心脏代谢亚组发现和风险预测:NHANES 和 NSRR 的多研究多领域分析
  • 批准号:
    10639360
    10639360
  • 财政年份:
    2023
  • 资助金额:
    --
    --
  • 项目类别:
Risk stratifying indeterminate pulmonary nodules with jointly learned features from longitudinal radiologic and clinical big data
利用纵向放射学和临床大数据共同学习的特征对不确定的肺结节进行风险分层
  • 批准号:
    10678264
    10678264
  • 财政年份:
    2023
  • 资助金额:
    --
    --
  • 项目类别:
Investigating the Protective Efficacy of SIV/HIV T and B cell Immunity Induced by RNA Replicons
研究 RNA 复制子诱导的 SIV/HIV T 和 B 细胞免疫的保护功效
  • 批准号:
    10673223
    10673223
  • 财政年份:
    2023
  • 资助金额:
    --
    --
  • 项目类别: