CAREER: Distilling information structure from big and dirty data: Efficient learning of clusters and graphs in modern datasets

职业:从大数据和脏数据中提取信息结构:现代数据集中集群和图的高效学习

基本信息

  • 批准号:
    1252412
  • 负责人:
  • 金额:
    $ 50万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2013
  • 资助国家:
    美国
  • 起止时间:
    2013-03-01 至 2018-02-28
  • 项目状态:
    已结题

项目摘要

This CAREER project aims to advance the state-of-the-art in theory and methods for extracting clusters and graphs from big and dirty datasets arising in modern application domains. Clusters and graphs provide a meaningful representation of the structure of information contained in data, e.g. in neuroscience and health care domains, clustering patients with similar phenotypes and genotypes helps identify target groups for drug design, clustering fiber tracks generated by high-resolution Digital Surface Imaging (DSI) scans of brains help identify significant neural pathways, and graph structures can reflect connectivity between brain regions. The results of this work will significantly enhance the ability to exploit such modern datasets through new methods for learning clusters and graphs from data that is large-scale, high-dimensional, under-sampled, corrupted, and often only available in a compressed or streaming representation. Specifically, this project will develop computationally efficient and principled methods for learning clusters and graphs that can (i) perform unsupervised feature selection to discard irrelevant features in high dimensions, (ii) leverage feedback based on intelligent adaptive queries that focus resources on most informative variables and features, (iii) use compressive measurement design that adapts to the information structure for measurement and computation efficiency, and (iv) be able to handle noisy streaming data. The algorithms will be accompanied with performance guarantees in the form of a precise characterization of the mis-clustering rate and graph recovery error. Additionally, the project will investigate the tradeoffs between number of measurements, computational complexity and robustness in these problems. The methods and theory developed will be evaluated through simulations as well as their applicability to real datasets in neuroscience and healthcare domain, in collaboration with practitioners from these fields. The results of this research could potentially transform many application domains that involve grouping similar variables and learning complex interactions between them, based on big and dirty datasets. In particular, the neuroscience and healthcare applications are likely have very direct and significant implications for society. Accurately mapping neural pathways will help diagnose and treat brain pathologies at an early stage, and help understand brain functioning. Clustering patients and discovering disease spreading pathways based on few measurements of relevant genetic features or indicators could help prevent and cure diseases, and also minimize healthcare costs. The research activities will be tightly integrated with education efforts that aim to develop a diverse workforce that is better equipped with cross-disciplinary tools to address the challenges of modern datasets. The education plan includes development of two inter-disciplinary courses, and enhancement of the joint Statistics & Machine Learning PhD program at Carnegie Mellon University (CMU). Outreach activities include promoting undergraduate research, broadening participation of women and underrepresented groups in STEM fields through OurCS (Opportunities for Undergraduate Research in Computer Science), Andrew?s Leap (a summer enrichment program for area high school and middle school students) and CS4HS program aimed at High School and K-8 teachers at Carnegie Mellon University. The results of this project (including publications, data sets, and software) will be disseminated online at http://www.cs.cmu.edu/~aarti/research_projects/.
该职业项目旨在推进理论的最新方法,以及从现代应用程序域中出现的大而肮脏数据集中提取群集和图形的方法。簇和图提供了数据中包含的信息结构的有意义表示,例如在神经科学和医疗保健领域中,具有相似表型和基因型的聚类患者有助于识别药物设计的目标群,聚集纤维轨道由高分辨率数字表面成像(DSI)扫描产生的大脑扫描有助于识别重要的神经途径,并且图形结构可以反映大脑区域之间的连接性。这项工作的结果将大大增强通过学习群集和图形的新方法来利用此类现代数据集的能力,这些方法是从大规模,高维,不采样,损坏,损坏且通常仅在压缩或流式传输表示中可用的数据的能力。具体而言,该项目将开发用于学习簇和图表的计算高效和原则性方法,可以(i)执行无监督的功能选择,以丢弃高维度的无关特征,(ii)利用基于智能的自适应查询来利用反馈,以智能的适应性查询,这些查询将资源侧重于大多数信息的效率和计算效率,以适应大多数信息的效率,(III),(iii)适用于适应性的效果,(III),(III)适应了信息(III),(III)效率(iii),(iii)适应性(iii),(iii)的信息(iii),(iii)的信息(iii)效率(iii)的信息(iii)效率(iii)效率(iii)。处理嘈杂的流数据。该算法将伴随性能保证,以精确表征误差率和图形恢复误差的形式。此外,该项目将调查这些问题中测量数量,计算复杂性和鲁棒性之间的权衡。开发的方法和理论将通过模拟以及它们适用于神经科学和医疗保健领域的真实数据集的适用性,并与这些领域的从业人员合作。这项研究的结果可能会改变许多应用程序域,这些应用程序域涉及基于大和肮脏的数据集将相似变量分组和学习复杂的相互作用。特别是,神经科学和医疗保健应用可能对社会具有非常直接和重要的影响。准确地绘制神经通路将有助于早期诊断和治疗大脑病理,并有助于了解大脑功能。基于几乎相关的遗传特征或指标的测量,将患者聚类并发现疾病扩散途径可以帮助预防和治愈疾病,并最大程度地减少医疗保健费用。研究活动将与教育工作紧密整合,旨在开发多样化的劳动力,该劳动力更好地配备了跨学科工具来应对现代数据集的挑战。教育计划包括开发两项跨学科课程,以及在卡内基·梅隆大学(CMU)的联合统计和机器学习博士计划的增强。外展活动包括促进本科研究,通过ORCS(在计算机科学领域的本科研究的机会),Andrew'S Leap(针对地区高中和中学生的夏季丰富计划)以及CS4HS计划的夏季丰富计划和Carnegie Mellon Mellon Mellon University的K-8老师。该项目的结果(包括出版物,数据集和软件)将在http://www.cs.cmu.edu/~aarti/~aarti/research_projects/上在线传播。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Aarti Singh其他文献

Computationally Feasible Near-Optimal Subset Selection for Linear Regression under Measurement Constraints
测量约束下线性回归的计算可行的近最优子集选择
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yining Wang;Adams Wei Yu;Aarti Singh
  • 通讯作者:
    Aarti Singh
Scope of Automation in Semantics-Driven Multimedia Information Retrieval From Web
语义驱动的网络多媒体信息检索的自动化范围
  • DOI:
    10.4018/978-1-5225-2483-0.ch001
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Aarti Singh;N. Dey;A. Ashour
  • 通讯作者:
    A. Ashour
Supercritical carbon dioxide extraction of essential oils from leaves of Eucalyptus globulus L., their analysis and application
超临界二氧化碳萃取蓝桉叶精油及其分析与应用
  • DOI:
    10.1039/c5ay02009c
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    3.1
  • 作者:
    Aarti Singh;Anees Ahmad;R. Bushra
  • 通讯作者:
    R. Bushra
Minimax rates for homology inference
同源推理的极小极大率
Worldwide Macroeconomic Stability and Monetary Policy Rules
全球宏观经济稳定和货币政策规则
  • DOI:
    10.2139/ssrn.906863
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    0
  • 作者:
    J. Bullard;Aarti Singh
  • 通讯作者:
    Aarti Singh

Aarti Singh的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Aarti Singh', 18)}}的其他基金

AI Institute for Societal Decision Making (AI-SDM)
人工智能社会决策研究所 (AI-SDM)
  • 批准号:
    2229881
  • 财政年份:
    2023
  • 资助金额:
    $ 50万
  • 项目类别:
    Cooperative Agreement
Collaborative Research: New Perspectives on Deep Learning: Bridging Approximation, Statistical, and Algorithmic Theories
合作研究:深度学习的新视角:桥接近似、统计和算法理论
  • 批准号:
    2134133
  • 财政年份:
    2021
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
QuBBD: Collaborative Research: Personalized Predictive Neuromarkers for Stress-Related Health Risks
QuBBD:合作研究:压力相关健康风险的个性化预测神经标志物
  • 批准号:
    1557572
  • 财政年份:
    2015
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
15th IMS New Researchers Conference
第15届IMS新研究员大会
  • 批准号:
    1301845
  • 财政年份:
    2013
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
BIGDATA: Mid-Scale: DA: Distribution-based machine learning for high dimensional datasets
BIGDATA:中规模:DA:针对高维数据集的基于分布的机器学习
  • 批准号:
    1247658
  • 财政年份:
    2013
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
III: Small: Spectral Methods for Active Clustering and Bi-Clustering
III:小:主动聚类和双聚类的谱方法
  • 批准号:
    1116458
  • 财政年份:
    2011
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant

相似国自然基金

基于知识蒸馏的分子信息缺失下胃癌影像及病理多模态精准分型研究
  • 批准号:
    82303950
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
膜蒸馏用高通量垂直通孔疏水膜的抗润湿机制研究
  • 批准号:
    22308293
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
融合“模糊-深度”数据增强与知识蒸馏的癫痫辅助诊断关键技术研究
  • 批准号:
    62376094
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
膜蒸馏用温敏性超双疏PVDF膜的构筑及其抗污染机制研究
  • 批准号:
    52303125
  • 批准年份:
    2023
  • 资助金额:
    30.00 万元
  • 项目类别:
    青年科学基金项目
基于气象预训练模型知识蒸馏的中小尺度灾害性天气预报
  • 批准号:
    62306028
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

DASS: Distilling Software Design Principles from Cybersecurity Caselaw
DASS:从网络安全判例中提炼软件设计原则
  • 批准号:
    2217597
  • 财政年份:
    2023
  • 资助金额:
    $ 50万
  • 项目类别:
    Interagency Agreement
Distilling melodies with algorithms
用算法提炼旋律
  • 批准号:
    572149-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
    University Undergraduate Student Research Awards
Distilling the relationship of parental psychiatric illness to offspring productivity and social outcomes: evidence base for preventive strategies
提炼父母精神疾病与后代生产力和社会结果的关系:预防策略的证据基础
  • 批准号:
    10708878
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
Distilling the relationship of parental psychiatric illness to offspring productivity and social outcomes: evidence base for preventive strategies
提炼父母精神疾病与后代生产力和社会结果的关系:预防策略的证据基础
  • 批准号:
    10506724
  • 财政年份:
    2022
  • 资助金额:
    $ 50万
  • 项目类别:
Production, formulation and consumer testing of an organic extract of waste bananas which boosts the efficiency of distilling and brewing fermentations
废弃香蕉有机提取物的生产、配制和消费者测试,可提高蒸馏和酿造发酵的效率
  • 批准号:
    90583
  • 财政年份:
    2021
  • 资助金额:
    $ 50万
  • 项目类别:
    Collaborative R&D
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了