XTRIPODS: Algorithms and Machine Learning in Data Intensive Models

XTRIPODS：数据密集型模型中的算法和机器学习

基本信息

批准号：
2342527
负责人：
Hoa Vu
金额：
$ 20万
依托单位：
San Diego State University Foundation
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-02-15 至 2026-01-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2342527&HistoricalAwards=false
关键词：
XTRIPODS Algorithms Machine Learning Data

项目摘要

Large datasets have emerged within numerous scientific disciplines, unveiling valuable insights and helping to develop various useful applications. However, they also pose several challenges due to their ever-growing size and dynamic nature. Often, these data sets are processed as data streams or distributed across multiple machines. Sketching and streaming algorithms have been successful in tackling many problems in these settings, ranging from data analysis, network algorithms, to optimization. One research objective of this project is to further improve these algorithms, in terms of time and memory efficiency, with the aid of machine learning predictions. This project will also apply sketching techniques to develop federated machine learning algorithms where data is distributed across machines or devices, offering privacy advantages due to their decentralized nature. The project also aims to improve the foundation of data science and computer science education at San Diego State University and in the community at large through collaboration with the TRIPODS EnCore Institute at UC San Diego.Unlike traditional worst-case analysis, by incorporating machine learning to unravel the underlying structure of the data, it becomes possible in many cases to design better algorithms. The investigator plans to improve the efficiency of existing sketching and streaming algorithms using machine learning. These improvements are in terms of space and time complexity as well as approximation quality. A wide range of problems in this paradigm including data summarization, graph theory, and combinatorial optimization will be considered. Additionally, the investigator plans to utilize sketching to aid the design of machine learning algorithms in distributed and federated settings. Data sketches offer several advantages for this task. They have a small memory footprint and can be merged to form a sketch of the combined data. Additionally, they reveal minimal information about local data, benefiting privacy. The investigator aims to employ sketching algorithms on various problems such as building boosted decision trees for classification and regression, and learning a Bayesian network to explain the data. The investigator will also develop new computer science courses at San Diego State University to improve data science education and collaborate with the TRIPODS EnCORE Institute at UC San Diego, to expand a summer boot camp for high school students focusing on data science.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

许多科学学科中都出现了大型数据集，揭示了有价值的见解并帮助开发各种有用的应用程序。然而，由于其不断增长的规模和动态性质，它们也带来了一些挑战。通常，这些数据集作为数据流进行处理或分布在多台机器上。草图和流算法已经成功地解决了这些设置中的许多问题，从数据分析、网络算法到优化。该项目的研究目标之一是借助机器学习预测，在时间和内存效率方面进一步改进这些算法。该项目还将应用草图技术来开发联合机器学习算法，其中数据分布在机器或设备上，由于其去中心化性质而提供隐私优势。该项目还旨在通过与加州大学圣地亚哥分校 TRIPODS EnCore 研究所合作，改善圣地亚哥州立大学和整个社区的数据科学和计算机科学教育基础。与传统的最坏情况分析不同，通过将机器学习融入到揭示数据的底层结构，在许多情况下可以设计出更好的算法。研究人员计划使用机器学习来提高现有草图和流算法的效率。这些改进体现在空间和时间复杂度以及近似质量方面。将考虑该范式中的各种问题，包括数据汇总、图论和组合优化。此外，研究人员计划利用草图来帮助设计分布式和联合环境中的机器学习算法。数据草图为此任务提供了多个优势。它们的内存占用很小，可以合并以形成组合数据的草图。此外，它们只透露有关本地数据的最少信息，有利于隐私。研究人员的目标是在各种问题上采用草图算法，例如构建用于分类和回归的增强决策树，以及学习贝叶斯网络来解释数据。研究人员还将在圣地亚哥州立大学开发新的计算机科学课程，以改善数据科学教育，并与加州大学圣地亚哥分校的 TRIPODS EnCORE 研究所合作，为专注于数据科学的高中生扩大夏令营。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力优点和更广泛的影响审查标准进行评估，被认为值得支持。