CNS Core: Small: Toward Globally-Optimal Resource Distribution and Computation Acceleration in Multi-Tenant and Heterogeneous Machine Learning Systems

CNS 核心：小型：在多租户和异构机器学习系统中实现全局最优资源分配和计算加速

基本信息

批准号：
2008248
负责人：
Eric Xing
金额：
$ 49.99万
依托单位：
Carnegie-Mellon University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-10-01 至 2023-09-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2008248&HistoricalAwards=false
关键词：
CNS Core Small Toward Globally

项目摘要

In the era of large-scale deep learning (DL) and massive data, existing hardware systems have struggled to effectively accommodate heavy and complex computing workload due to difficulties in scheduling highly dynamic, heterogeneous, and competing tasks from many users over many machines in a cluster or data-center environment. This project aims to develop a "1-click" demand-aware and responsive software system capable of simultaneously training a wide spectrum of DL tasks, using a new resource management architecture that automatically and adaptively chooses the most effective distributed training/serving techniques and their hyperparameters to achieve best overall efficiency of multiple tasks in such environment.This interdisciplinary project innovates in distributed systems design, DL algorithm design, and related industrial applications and theoretical analyses, with the following thrusts: 1: Develop a framework for "ML-aware" resource management and scheduling of multiple simultaneously running training tasks. 2: Develop principled strategies for resource management and scheduling for serving, streaming, and heterogeneous-task settings. 3: Optimize memory resources for training large-parameter models by developing holistic approaches to maximize computation throughput subject to device memory bounds. A limited-scope but rigorous and practical theoretical analysis of some of the proposed architectures will also be performed. This project addresses the needs from the academic and industrial communities and will have a broad impact on both. It will provide easy-to-use tools that reduce the time to set-up and facilitate large-scale experimentation, while reducing the required costs, whether measured in cluster access quotas or dollars spent on cloud services. The impact on commercial practitioners will be even greater, by improving their productivity by an order of magnitude or more, as they must contend with heterogeneous computing and network resources that are shared among many users as well as the need to run many jobs on a regular basis.The team will release and/or open-source the code at http://sailing-lab.wixsite.com/sailing-pmls to benefit researchers and practitioners, to share their lessons learned to advocate more research in machine learning (ML) systems problems, and also to democratize high-performance ML systems and make them accessible to non-ML-educated software developers and society at large, such as industrial and manufacturing, healthcare, biology, social science, and finance, where results may have a catalytic impact. The team will publish results at a variety of top tier conferences, including machine learning (NIPS, ICML), systems (OSDI, SOSP, USENIX), and data mining (KDD, WWW).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在大规模深度学习（DL）和大量数据的时代，由于在安排高度动态，异质性和竞争性任务的困难中，许多用户在集群或数据中心环境中，许多用户在许多用户中安排了高度动态，异构和竞争任务，因此现有的硬件系统一直在努力适应重型和复杂的计算工作负载。该项目旨在使用一种新的资源管理体系结构来开发一个能够同时训练一系列DL任务的“ 1键”需求和响应式软件系统，该系统使用新的资源管理体系结构自动和适应性地选择了最有效的分布式培训/服务技术及其超级参数，以实现此类环境中的多个任务的最佳整体效率。工业应用和理论分析，并具有以下推力：1：为“ ML感知”资源管理和安排多个同时运行培训任务的安排制定框架。 2：制定用于服务，流媒体和异构任务设置的资源管理和计划的原则性策略。 3：通过开发整体方法来优化训练大参数模型的内存资源，以最大程度地提高计算吞吐量受设备内存界限。还将对一些提出的架构进行有限但严格且实用的理论分析。该项目满足了学术和工业社区的需求，并将对两者产生广泛的影响。它将提供易于使用的工具，以减少设置和促进大规模实验的时间，同时降低所需的成本，无论是在集群访问配额中衡量还是花在云服务上的美元。 The impact on commercial practitioners will be even greater, by improving their productivity by an order of magnitude or more, as they must contend with heterogeneous computing and network resources that are shared among many users as well as the need to run many jobs on a regular basis.The team will release and/or open-source the code at http://sailing-lab.wixsite.com/sailing-pmls to benefit researchers and practitioners, to share their lessons learned提倡更多的机器学习研究（ML）系统问题，并使高性能ML系统民主化，并使其可容纳非ML教育的软件开发人员和整个社会，例如工业和制造业，医疗保健，生物学，社会科学和财务，结果可能会产生催化影响。该团队将在各种顶级会议上发布结果，包括机器学习（NIPS，ICML），Systems（OSDI，SOSP，USENIX）和数据挖掘（KDD，www）。这一奖项反映了NSF的法定任务，并被认为是通过基金会的知识分子优点和广泛的效果来评估的，并被认为是值得的支持。

项目成果

期刊论文数量（5）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

MPCFormer: fast, performant and private Transformer inference with MPC

DOI：
10.48550/arxiv.2211.01452
发表时间：
2022-11
期刊：
ArXiv
影响因子：
0
作者：
Dacheng Li;Rulin Shao;Hongyi Wang;Han Guo;Eric P. Xing;Haotong Zhang
通讯作者：
Dacheng Li;Rulin Shao;Hongyi Wang;Han Guo;Eric P. Xing;Haotong Zhang

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Eric Xing其他文献

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

您的数据对 GPT 有何价值？

DOI：
发表时间：
2024
期刊：
arXiv.org
影响因子：
0
作者：
Sang Keun Choe;Hwijeen Ahn;Juhan Bae;Kewen Zhao;Minsoo Kang;Youngseog Chung;Adithya Pratapa;W. Neiswanger;Emma Strubell;Teruko Mitamura;Jeff Schneider;Eduard Hovy;Roger Grosse;Eric Xing
通讯作者：
Eric Xing