Optimization Techniques for Geometrizing Real-World Data

现实世界数据几何化的优化技术

基本信息

  • 批准号:
    2044349
  • 负责人:
  • 金额:
    $ 2.82万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-06-01 至 2021-07-31
  • 项目状态:
    已结题

项目摘要

Data is a common denominator to scientific fields, governments, and private enterprises. Being able to exploit data to find patterns has produced scientific breakthroughs and shifted business paradigms in the last several decades. This project focuses on mathematical and algorithmic techniques for specific data science problems, tailored to currently relevant domain problems, technologies, and volumes of data. The theoretical problems we consider are (i) clustering (which essentially consists on grouping data according to similarity in an unsupervised way), (ii) dimensionality reduction (reducing the volume of the data while preserving its relevant features), and (iii) quadratic assignment (finding correspondences between different datasets). The main underlying application we consider in this project is computational biology, in particular the processing of single-cell sequencing data. The technology for single-cell sequencing has been very recently developed and it is improving quickly, producing new datasets, problems and challenges that are interesting from a mathematical point of view and have potentially enormous impact. The project will have mathematicians working closely to computational biologists with the goal of identifying data science problems occurring in the scientific domain and to develop appropriate algorithms and mathematical tools.Given single-cell genetic expression data indicating how many times each gene is expressed in each cell, one objective is to select a few genes that can be used to identify different classes of cells. This problem is known in the computational biology literature as genetic marker selection. In a first approach we assume the class of each cell is known and the problem can be posed as supervised dimensionality reduction. We model it as a projection factor recovery problem, and we approach it using optimization tools such as semidefinite and linear programming. The objective is two-fold, we aim to study mathematical properties of the model we devise, and to develop an efficient tool to be used by practitioners. A second stage of the project is to make the problem unsupervised, therefore clustering will be a fundamental step. We will study stability properties of clustering methods and we will provide an efficient algorithm to evaluate the quality of clusters, based on statistical and optimization techniques. The potential use of this tool is general to data science and not just gene expression datasets. Finally, a third objective is to align datasets coming from different experiments. This problem is ubiquitous in data science, with graph matching and shape matching as some particular cases. In the context of computational biology the alignment problem is known as batch correction and it can be modeled with optimal transport or as a quadratic assignment problem. We will develop alignment algorithms and study their convergence and recovery properties under different data models.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
数据是科学领域,政府和私营企业的共同点。在过去的几十年中,能够利用数据来找到模式,从而产生了科学突破并改变了业务范例。该项目着重于针对特定数据科学问题的数学和算法技术,该技术针对当前相关的域问题,技术和数据量量身定制。我们考虑的理论问题是(i)聚类(基本上是根据相似性以无监督的方式对数据进行分组),(ii)减少维度(在保留其相关特征的同时减少数据的量),以及(iii)四次分配(查找不同数据集之间的对应度))。我们在该项目中考虑的主要基础应用是计算生物学,特别是单细胞测序数据的处理。单细胞测序的技术最近已经开发,它正在迅速改进,从数学角度产生了有趣的数据集,问题和挑战,并且具有巨大的影响。该项目将使数学家与计算生物学家紧密合作,目的是识别科学领域中发生的数据科学问题,并开发适当的算法和数学工具。启动的单细胞遗传表达数据,表明每个基因在每个单元中表达了多少次,一个目标是选择几个可以使用不同类别的细胞类别的基因。该问题在计算生物学文献中被称为遗传标记。在第一种方法中,我们假设每个单元格的类别都是已知的,并且可以将问题作为监督维度降低。我们将其建模为投影因子恢复问题,并使用优化工具(例如半芬矿和线性编程)进行对其进行处理。该目标是两个方面,我们旨在研究我们设计的模型的数学特性,并开发一种有效的工具,可以被从业人员使用。该项目的第二阶段是使问题无监督,因此聚类将是一个基本的一步。我们将研究聚类方法的稳定性特性,并将提供一种有效的算法来根据统计和优化技术评估集群质量。该工具的潜在用途是数据科学的一般使用,而不仅仅是基因表达数据集。最后,第三个目标是使来自不同实验的数据集对齐。这个问题在数据科学中无处不在,图形匹配和形状匹配是某些特定情况。在计算生物学的背景下,对齐问题称为批处理校正,可以通过最佳传输或二次分配问题进行建模。我们将开发一致性算法并研究其在不同数据模型下的收敛性和恢复性能。该奖项反映了NSF的法定任务,并被认为是值得通过基金会的知识分子优点和更广泛的影响来通过评估来支持的。

项目成果

期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Cibercoloquio latinoamericano de matemáticas
拉丁美洲数学论坛
Fitting Very Flexible Models: Linear Regression With Large Numbers of Parameters
拟合非常灵活的模型:具有大量参数的线性回归
A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants
SqueezeFit: Label-Aware Dimensionality Reduction by Semidefinite Programming
  • DOI:
    10.1109/tit.2019.2962681
  • 发表时间:
    2020-06-01
  • 期刊:
  • 影响因子:
    2.5
  • 作者:
    McWhirter, Culver;Mixon, Dustin G.;Villar, Soledad
  • 通讯作者:
    Villar, Soledad
Experimental performance of graph neural networks on random instances of max-cut
  • DOI:
    10.1117/12.2529608
  • 发表时间:
    2019-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Weichi Yao;A. Bandeira;Soledad Villar
  • 通讯作者:
    Weichi Yao;A. Bandeira;Soledad Villar
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Soledad Villar其他文献

Manifold optimization for k-means clustering
k 均值聚类的流形优化
A polynomial-time relaxation of the Gromov-Hausdorff distance
Gromov-Hausdorff 距离的多项式时间松弛
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Soledad Villar;A. Bandeira;A. Blumberg;Rachel A. Ward
  • 通讯作者:
    Rachel A. Ward
MarkerMap: nonlinear marker selection for single-cell studies
MarkerMap:单细胞研究的非线性标记选择
  • DOI:
    10.1038/s41540-024-00339-3
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    4
  • 作者:
    Nabeel Sarwar;Wilson Gregory;George A. Kevrekidis;Soledad Villar;Bianca Dumitrascu
  • 通讯作者:
    Bianca Dumitrascu
Shuffled linear regression through graduated convex relaxation
通过分级凸松弛进行洗牌线性回归
  • DOI:
    10.48550/arxiv.2209.15608
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Efe Onaran;Soledad Villar
  • 通讯作者:
    Soledad Villar
Learning Structured Representations with Equivariant Contrastive Learning
通过等变对比学习学习结构化表示
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Sharut Gupta;Joshua Robinson;Derek Lim;Soledad Villar;S. Jegelka
  • 通讯作者:
    S. Jegelka

Soledad Villar的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Soledad Villar', 18)}}的其他基金

CAREER: Symmetries and Classical Physics in Machine Learning for Science and Engineering
职业:科学与工程机器学习中的对称性和经典物理学
  • 批准号:
    2339682
  • 财政年份:
    2024
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Continuing Grant
Collaborative Research: CIF: Medium: Understanding Robustness via Parsimonious Structures.
合作研究:CIF:中:通过简约结构了解鲁棒性。
  • 批准号:
    2212457
  • 财政年份:
    2022
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Standard Grant
Optimization Techniques for Geometrizing Real-World Data
现实世界数据几何化的优化技术
  • 批准号:
    1913134
  • 财政年份:
    2019
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Standard Grant

相似国自然基金

基于非结构化等几何分析的结构拓扑优化技术
  • 批准号:
  • 批准年份:
    2022
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于非结构化等几何分析的结构拓扑优化技术
  • 批准号:
    52205267
  • 批准年份:
    2022
  • 资助金额:
    30.00 万元
  • 项目类别:
    青年科学基金项目
面向户外未知大场景的自主在线智能扫描关键技术研究
  • 批准号:
    61902254
  • 批准年份:
    2019
  • 资助金额:
    25.0 万元
  • 项目类别:
    青年科学基金项目
面向CAD/CNC的曲线/曲面表示理论与方法研究
  • 批准号:
    61872332
  • 批准年份:
    2018
  • 资助金额:
    63.0 万元
  • 项目类别:
    面上项目
基于离散化几何形状误差模型的产品规范建模理论研究
  • 批准号:
    51305006
  • 批准年份:
    2013
  • 资助金额:
    25.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Postdoctoral Fellowship: OPP-PRF: Leveraging Community Structure Data and Machine Learning Techniques to Improve Microbial Functional Diversity in an Arctic Ocean Ecosystem Model
博士后奖学金:OPP-PRF:利用群落结构数据和机器学习技术改善北冰洋生态系统模型中的微生物功能多样性
  • 批准号:
    2317681
  • 财政年份:
    2024
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Standard Grant
RII Track-4:NSF: Design of zeolite-encapsulated metal phthalocyanines catalysts enabled by insights from synchrotron-based X-ray techniques
RII Track-4:NSF:通过基于同步加速器的 X 射线技术的见解实现沸石封装金属酞菁催化剂的设计
  • 批准号:
    2327267
  • 财政年份:
    2024
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Standard Grant
CAREER: Data-Driven Hardware and Software Techniques to Enable Sustainable Data Center Services
职业:数据驱动的硬件和软件技术,以实现可持续的数据中心服务
  • 批准号:
    2340042
  • 财政年份:
    2024
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Continuing Grant
Creating a reflective, assessment workbook for University teachers to enhance teaching techniques and improve student engagement, by incorporating International Baccalaureate (IB) teaching practices
通过纳入国际文凭 (IB) 教学实践,为大学教师创建反思性评估工作簿,以提高教学技巧并提高学生参与度
  • 批准号:
    24K06129
  • 财政年份:
    2024
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Developing Advanced Cryptanalysis Techniques for Symmetric-key Primitives with Real-world Public-key Applications
使用现实世界的公钥应用开发对称密钥原语的高级密码分析技术
  • 批准号:
    24K20733
  • 财政年份:
    2024
  • 资助金额:
    $ 2.82万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了