Algorithms for Data Analysis
数据分析算法
基本信息
- 批准号:0515342
- 负责人:
- 金额:--
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2005
- 资助国家:美国
- 起止时间:2005-06-01 至 2009-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Intellectual merit of the proposed activity. The primary focus of this investigation is the following ubiquitous algorithmic challenge: Given a large data set, provide a "simple" yet "accurate" description of the data.The question is treated at several levels. In the first it is considered as a data-clustering problem. A clustering algorithm (designed with regard to a particular "distortion" that measures dissimilarity of points) partitions the data into clusters so that the distortion between points within a cluster is minimized (i.e., similarity is maximized), while the distortion across clusters is maximized. In the huge databases prevalent in scientific, engineering, business, government and medical applications, "polynomial time algorithms" are not good enough since a typical algorithm running in quadratic or cubic time is too slow to be executed on gigabytes or terabytes of data. This proposal therefore places an emphasis on near-linear time algorithms. A particular novel approach, recently developed by the proposer, is described.In the second treatment, more advanced data analysis problems are addressed. In these problems richer and more flexible models for the data are allowed, such as mixture models of several low-degree surfaces. Data analysis in these models is beyond the reach of current approximation algorithms. The data analysis requires partitioning of the data, and so the NP-completeness of clustering is inherited; but in addition, new difficulties are inherited from applied statistics, such as the issue of missing or incomplete records in real world data. The combination requires solution of fundamental new algorithmic and geometric questions.The third part of the proposal is concerned not with a particular clustering problem but with setting up the right kind of theory for such nonparametric inference problems. The fundamental principle is this: We should seek to cluster only data sets that admit a high-quality clustering. Often, what makes a particular input hard for a clustering algorithm is that its best clustering is little better than average: the input does not separate cleanly into well-separated pieces. A good data analysis should detect this fact, rather than stall on a search for a best yet near-average clustering. Developing a sound theoretical framework for clustering therefore requires departing from the "worst-case analysis" framework that has been dominant so far.The fourth part of the proposal is devoted to theoretical analysis of the EM clustering algorithm and its variants. EM is an iterative heuristic for which there are few performance guarantees. However, it is fast, and widely used in practice. Because of this, an important goal is to determine under what conditions EM performs well, and what conditions require other approaches.The fifth part of the proposal is devoted to a different topic: the limitations imposed on feedbackControl mechanisms because of limited processing power or limited communication capacity. Because of the real-time nature of control tasks, conventional "input-output" circuit complexity is not an appropriate theoretical framework. Two general types of questions are pursued. The first is characterization of the class of stabilization tasks that can be performed by highly parallelized, ultra-fast digital control circuits. The second is the ensuring of reliable and real-time communications among components of a control system in spite of noise on the communication channels. Broader impact of the proposed activity. Data analysis is required in many scientific, engineering, business, government and medical applications. The importance of sifting out useful information from huge databases is such, that advances in data analysis algorithms as well as in the mathematical framework underpinning the design of these algorithms, has significant potential benefit to society.Increasing automation in our technological infrastructure has created a convergence of control, computation and communication technologies. Some of the difficulties associated with carrying out this convergence will be resolved through progress on the questions described in the last section of the proposal.All topics in the proposal are highly suited to graduate, and in some cases undergraduate, research. Interdisciplinary student training in algorithms, complexity, control and information theory is ongoing (by the PI and several colleagues). Increase in this activity is intended during the period of the proposed investigation.
拟议活动的智力价值。这项调查的主要焦点是以下普遍存在的算法挑战:给定一个大数据集,提供数据的“简单”但“准确”的描述。该问题在多个层面上进行处理。首先,它被视为数据聚类问题。聚类算法(针对测量点不相似性的特定“失真”而设计)将数据划分为簇,以便使簇内点之间的失真最小化(即相似性最大化),同时使簇间的失真最大化。在科学、工程、商业、政府和医疗应用中普遍存在的大型数据库中,“多项式时间算法”还不够好,因为在二次或三次时间中运行的典型算法太慢,无法在千兆字节或太字节的数据上执行。因此,该提案强调近线性时间算法。 描述了提议者最近开发的一种特殊的新颖方法。在第二种处理中,解决了更高级的数据分析问题。在这些问题中,允许使用更丰富、更灵活的数据模型,例如多个低阶曲面的混合模型。这些模型中的数据分析超出了当前近似算法的能力范围。数据分析需要对数据进行划分,因此继承了聚类的NP完备性;但除此之外,应用统计也带来了新的困难,例如现实世界数据中记录缺失或不完整的问题。这种结合需要解决基本的新算法和几何问题。该提案的第三部分不涉及特定的聚类问题,而是为此类非参数推理问题建立正确的理论。基本原则是:我们应该只对那些能够进行高质量聚类的数据集进行聚类。通常,特定输入对于聚类算法来说很困难,因为它的最佳聚类效果比平均水平好不了多少:输入不能干净地分成分离良好的部分。良好的数据分析应该检测到这一事实,而不是停止寻找最佳但接近平均的聚类。因此,开发一个完善的聚类理论框架需要摆脱迄今为止占主导地位的“最坏情况分析”框架。该提案的第四部分致力于 EM 聚类算法及其变体的理论分析。 EM 是一种迭代启发式算法,几乎没有性能保证。然而,它速度快,并且在实践中被广泛使用。因此,一个重要的目标是确定 EM 在什么条件下表现良好,以及什么条件需要其他方法。提案的第五部分致力于不同的主题:由于处理能力有限或有限而对反馈控制机制施加的限制通讯能力。由于控制任务的实时性,传统的“输入-输出”电路复杂性并不是合适的理论框架。提出两种一般类型的问题。第一个是可以由高度并行、超快速数字控制电路执行的稳定任务类别的特征。第二个是确保控制系统组件之间的可靠和实时通信,尽管通信通道上存在噪声。拟议活动的更广泛影响。许多科学、工程、商业、政府和医疗应用都需要数据分析。从庞大的数据库中筛选出有用信息的重要性在于,数据分析算法以及支撑这些算法设计的数学框架的进步,对社会具有巨大的潜在利益。我们技术基础设施的自动化程度不断提高,创造了融合控制、计算和通信技术。与实现这种融合相关的一些困难将通过提案最后一部分中描述的问题的进展得到解决。提案中的所有主题都非常适合研究生(有时甚至是本科生)的研究。算法、复杂性、控制和信息论方面的跨学科学生培训正在进行中(由 PI 和几位同事进行)。计划在拟议调查期间增加此类活动。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Leonard Schulman其他文献
Leonard Schulman的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Leonard Schulman', 18)}}的其他基金
NSF-BSF: AF: Small: Algorithmic and Information-Theoretic Challenges in Causal Inference
NSF-BSF:AF:小:因果推理中的算法和信息论挑战
- 批准号:
2321079 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Standard Grant
NSF-BSF: AF: Small: Identifying Functional Structure in Data
NSF-BSF:AF:小:识别数据中的功能结构
- 批准号:
1909972 - 财政年份:2019
- 资助金额:
-- - 项目类别:
Standard Grant
AF: Small: Algorithms and Information Theory for Causal Inference
AF:小:因果推理的算法和信息论
- 批准号:
1618795 - 财政年份:2016
- 资助金额:
-- - 项目类别:
Standard Grant
AF: EAGER: Algorithms in Linear Algebra and Optimization
AF:EAGER:线性代数和优化算法
- 批准号:
1038578 - 财政年份:2011
- 资助金额:
-- - 项目类别:
Continuing Grant
Collaborative Research: EMT/QIS: Quantum Algorithms and Post-Quantum Cryptography
合作研究:EMT/QIS:量子算法和后量子密码学
- 批准号:
0829909 - 财政年份:2008
- 资助金额:
-- - 项目类别:
Continuing Grant
SGER: Planning for a Cross-Cutting Initiative in Computational Discovery
SGER:规划计算发现的跨领域计划
- 批准号:
0652536 - 财政年份:2007
- 资助金额:
-- - 项目类别:
Standard Grant
QnTM: Collaborative Research: Quantum Algorithms
QnTM:协作研究:量子算法
- 批准号:
0524828 - 财政年份:2005
- 资助金额:
-- - 项目类别:
Continuing Grant
相似国自然基金
面向超大图数据分析的多样本分布式计算方法与算法研究
- 批准号:62372308
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
癌症单细胞DNA测序数据分析算法研究
- 批准号:
- 批准年份:2022
- 资助金额:54 万元
- 项目类别:面上项目
基于RNA测序数据的转录组拼接算法设计与分析
- 批准号:
- 批准年份:2022
- 资助金额:54 万元
- 项目类别:面上项目
基于深度神经网络的复杂图数据分析算法研究
- 批准号:12271111
- 批准年份:2022
- 资助金额:46 万元
- 项目类别:面上项目
面向大规模网络性能数据分析的并行张量填充算法研究
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
- 批准号:
10462257 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
- 批准号:
2415562 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Standard Grant
Sleep and Cardiometabolic Subgroup Discovery and Risk Prediction in United States Adolescents and Young Adults: A Multi-Study Multi-Domain Analysis of NHANES and NSRR
美国青少年和年轻人的睡眠和心脏代谢亚组发现和风险预测:NHANES 和 NSRR 的多研究多领域分析
- 批准号:
10639360 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Risk stratifying indeterminate pulmonary nodules with jointly learned features from longitudinal radiologic and clinical big data
利用纵向放射学和临床大数据共同学习的特征对不确定的肺结节进行风险分层
- 批准号:
10678264 - 财政年份:2023
- 资助金额:
-- - 项目类别:
Investigating the Protective Efficacy of SIV/HIV T and B cell Immunity Induced by RNA Replicons
研究 RNA 复制子诱导的 SIV/HIV T 和 B 细胞免疫的保护功效
- 批准号:
10673223 - 财政年份:2023
- 资助金额:
-- - 项目类别: