CAREER: Leveraging Randomization and Structure in Computational Linear Algebra for Data Science

职业：利用计算线性代数中的随机化和结构进行数据科学

基本信息

批准号：
2338655
负责人：
Michal Derezinski
金额：
$ 64.94万
依托单位：
Regents of the University of Michigan - Ann Arbor
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-05-01 至 2029-04-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2338655&HistoricalAwards=false
关键词：
CAREER Leveraging Randomization Structure Computational

项目摘要

Data science plays a central role in addressing societal challenges, such as healthcare, climate change, and urban planning. At the core of nearly all developments in algorithms for data science is computational linear algebra, an area that concerns the study of algorithms for solving ubiquitous problems involving matrices and other linear-algebraic objects that are used to represent data. With ever-increasing data sizes, randomization has become a key technique for developing efficient algorithms in computational linear algebra. Yet, there is a significant gap between the theory and practice of these algorithms, which has slowed their practical adoption in data science applications. This project identifies key challenges and puts forward new directions towards providing the algorithmic foundations necessary to ensure that a broad scope of randomized linear algebra algorithms are successfully deployed across computational data science over the next decade. This project leverages fundamental interdisciplinary ideas at the intersection of theoretical computer science, machine learning, statistics, and nonlinear optimization. In addition to developing the theoretical foundations, one of the key aims driving the project is to facilitate ongoing implementation efforts aimed at incorporating randomization into LAPACK, the default computational linear algebra software package in machine learning, engineering, statistics, and scientific computing for the past thirty years. At the core of the project is an integrated education plan focused on helping students to gain an interdisciplinary skillset at the intersection of algorithmic foundations and data science. The project also involves outreach to students from three underresourced high schools in Michigan through a collaboration with the university's Engineering Pathways program.The project’s objectives are to close the theory-practice gap in using randomization to design improved algorithms for ubiquitous matrix problems such as matrix multiplication, solving linear systems, and low-rank approximation. The project identifies three major thrusts, namely (1) reformulating optimal matrix sketching via black-box sampling methods; (2) randomized iterative refinement algorithms via stochastic optimization; (3) a study of robustness of randomized numerical linear algebra algorithms to preserve certain structural elements of data. The matrix sketch, i.e., a small randomized approximation of the input data is a key foundational component of these algorithms. The project aims to develop new algorithmic and theoretical approaches towards ensuring the control and reliability of the output produced by matrix sketching and sub-sampling, which is especially challenging when dealing with randomization and will be critical for successful software integration. Building on these tools, the project pursues new approaches for designing high-precision algorithms solving linear systems and quadratic problems, by exploring techniques that lie in the unexplored regime between deterministic iterative solvers and stochastic optimization. Finally, the project aims to contribute to a unified understanding of randomized matrix approximation algorithms that preserve the structure of the data, which is essential for feature selection, experimental design, interpretability, and more.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

数据科学在应对社会挑战（例如医疗保健，气候变化和城市规划）方面起着核心作用。数据科学算法中几乎所有发展的核心是计算线性代数，该区域涉及研究算法解决无处不在的问题的算法，涉及材料和其他用于表示数据的线性 - 代数对象。随着数据大小的不断增加，随机化已成为开发计算线性代数中有效算法的关键技术。然而，这些算法的理论和实践之间存在很大的差距，这使他们在数据科学应用中的实际采用减慢了。该项目确定了关键挑战，并提出了新的方向，以提供必要的算法基础，以确保在未来十年中成功地在计算数据科学中成功部署了广泛的随机线性代数算法。该项目利用理论计算机科学，机器学习，统计和非线性优化的交集的基本跨学科思想。除了开发理论基础外，推动该项目的关键目标之一是促进旨在将随机化纳入Lapack的持续实施工作，这是过去三十年来机器学习，工程，统计和科学计算的默认计算线性代数软件包。该项目的核心是一项综合教育计划，旨在帮助学生在算法基础和数据科学的交集中获得跨学科的技能。该项目还涉及与大学工程途径计划的合作，向密歇根州三所水资源不足的高中的学生展开展览。该项目的目标是弥合理论实践差距，以使用随机化来设计改进的算法，以改进无处不在的矩阵问题，例如matrix乘法，求解线性的线性系统，求解线性线性系统和低率近距离近距离。该项目确定了三个主要推力，即（1）通过黑盒抽样方法改革最佳矩阵草图；（2）通过随机优化的随机迭代改进算法；（3）研究随机数值线性代数算法的鲁棒性，以保留数据的某些结构元素。矩阵草图，即输入数据的小近似值是这些算法的关键基础成分。该项目旨在开发新的算法和理论方法，以确保矩阵草图和子采样产生的输出的控制和可靠性，这在处理随机化时尤其具有挑战性，对于成功的软件集成至关重要。该项目以这些工具为基础，采用新的方法来设计高精度算法解决线性系统和二次问题，通过探索在确定性迭代求解器和随机优化之间出乎意料的方向上的技术。最后，该项目旨在为保留数据结构的随机矩阵近似算法的统一理解做出贡献，这对于特征选择，实验设计，可解释性和更多是必不可少的。该奖项反映了NSF的法定任务，并认为使用该基金会的知识功能和广泛的影响来评估NSF的法定任务。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Michal Derezinski其他文献

Surrogate-based Autotuning for Randomized Sketching Algorithms in Regression Problems

回归问题中随机草图算法的基于代理的自动调整

DOI：
10.48550/arxiv.2308.15720
发表时间：
2023
期刊：
ArXiv
影响因子：
0
作者：
Younghyun Cho;J. Demmel;Michal Derezinski;Haoyun Li;Hengrui Luo;Michael W. Mahoney;Riley Murray
通讯作者：
Riley Murray