CNS Core: Small: Toward Globally-Optimal Resource Distribution and Computation Acceleration in Multi-Tenant and Heterogeneous Machine Learning Systems
CNS 核心:小型:在多租户和异构机器学习系统中实现全局最优资源分配和计算加速
基本信息
- 批准号:2008248
- 负责人:
- 金额:$ 49.99万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-10-01 至 2023-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In the era of large-scale deep learning (DL) and massive data, existing hardware systems have struggled to effectively accommodate heavy and complex computing workload due to difficulties in scheduling highly dynamic, heterogeneous, and competing tasks from many users over many machines in a cluster or data-center environment. This project aims to develop a "1-click" demand-aware and responsive software system capable of simultaneously training a wide spectrum of DL tasks, using a new resource management architecture that automatically and adaptively chooses the most effective distributed training/serving techniques and their hyperparameters to achieve best overall efficiency of multiple tasks in such environment.This interdisciplinary project innovates in distributed systems design, DL algorithm design, and related industrial applications and theoretical analyses, with the following thrusts: 1: Develop a framework for "ML-aware" resource management and scheduling of multiple simultaneously running training tasks. 2: Develop principled strategies for resource management and scheduling for serving, streaming, and heterogeneous-task settings. 3: Optimize memory resources for training large-parameter models by developing holistic approaches to maximize computation throughput subject to device memory bounds. A limited-scope but rigorous and practical theoretical analysis of some of the proposed architectures will also be performed. This project addresses the needs from the academic and industrial communities and will have a broad impact on both. It will provide easy-to-use tools that reduce the time to set-up and facilitate large-scale experimentation, while reducing the required costs, whether measured in cluster access quotas or dollars spent on cloud services. The impact on commercial practitioners will be even greater, by improving their productivity by an order of magnitude or more, as they must contend with heterogeneous computing and network resources that are shared among many users as well as the need to run many jobs on a regular basis.The team will release and/or open-source the code at http://sailing-lab.wixsite.com/sailing-pmls to benefit researchers and practitioners, to share their lessons learned to advocate more research in machine learning (ML) systems problems, and also to democratize high-performance ML systems and make them accessible to non-ML-educated software developers and society at large, such as industrial and manufacturing, healthcare, biology, social science, and finance, where results may have a catalytic impact. The team will publish results at a variety of top tier conferences, including machine learning (NIPS, ICML), systems (OSDI, SOSP, USENIX), and data mining (KDD, WWW).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在大规模深度学习(DL)和大量数据的时代,由于在安排高度动态,异质性和竞争性任务的困难中,许多用户在集群或数据中心环境中,许多用户在许多用户中安排了高度动态,异构和竞争任务,因此现有的硬件系统一直在努力适应重型和复杂的计算工作负载。该项目旨在使用一种新的资源管理体系结构来开发一个能够同时训练一系列DL任务的“ 1键”需求和响应式软件系统,该系统使用新的资源管理体系结构自动和适应性地选择了最有效的分布式培训/服务技术及其超级参数,以实现此类环境中的多个任务的最佳整体效率。工业应用和理论分析,并具有以下推力:1:为“ ML感知”资源管理和安排多个同时运行培训任务的安排制定框架。 2:制定用于服务,流媒体和异构任务设置的资源管理和计划的原则性策略。 3:通过开发整体方法来优化训练大参数模型的内存资源,以最大程度地提高计算吞吐量受设备内存界限。还将对一些提出的架构进行有限但严格且实用的理论分析。该项目满足了学术和工业社区的需求,并将对两者产生广泛的影响。它将提供易于使用的工具,以减少设置和促进大规模实验的时间,同时降低所需的成本,无论是在集群访问配额中衡量还是花在云服务上的美元。 The impact on commercial practitioners will be even greater, by improving their productivity by an order of magnitude or more, as they must contend with heterogeneous computing and network resources that are shared among many users as well as the need to run many jobs on a regular basis.The team will release and/or open-source the code at http://sailing-lab.wixsite.com/sailing-pmls to benefit researchers and practitioners, to share their lessons learned提倡更多的机器学习研究(ML)系统问题,并使高性能ML系统民主化,并使其可容纳非ML教育的软件开发人员和整个社会,例如工业和制造业,医疗保健,生物学,社会科学和财务,结果可能会产生催化影响。该团队将在各种顶级会议上发布结果,包括机器学习(NIPS,ICML),Systems(OSDI,SOSP,USENIX)和数据挖掘(KDD,www)。这一奖项反映了NSF的法定任务,并被认为是通过基金会的知识分子优点和广泛的效果来评估的,并被认为是值得的支持。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
MPCFormer: fast, performant and private Transformer inference with MPC
- DOI:10.48550/arxiv.2211.01452
- 发表时间:2022-11
- 期刊:
- 影响因子:0
- 作者:Dacheng Li;Rulin Shao;Hongyi Wang;Han Guo;Eric P. Xing;Haotong Zhang
- 通讯作者:Dacheng Li;Rulin Shao;Hongyi Wang;Han Guo;Eric P. Xing;Haotong Zhang
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Eric Xing其他文献
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
您的数据对 GPT 有何价值?
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Sang Keun Choe;Hwijeen Ahn;Juhan Bae;Kewen Zhao;Minsoo Kang;Youngseog Chung;Adithya Pratapa;W. Neiswanger;Emma Strubell;Teruko Mitamura;Jeff Schneider;Eduard Hovy;Roger Grosse;Eric Xing - 通讯作者:
Eric Xing
An exploratory study of self-supervised pre-training on partially supervised multi-label classification on chest X-ray images
胸部X射线图像部分监督多标签分类自监督预训练的探索性研究
- DOI:
10.1016/j.asoc.2024.111855 - 发表时间:
2024 - 期刊:
- 影响因子:8.7
- 作者:
Nanqing Dong;Michael Kampffmeyer;Haoyang Su;Eric Xing - 通讯作者:
Eric Xing
Eric Xing的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Eric Xing', 18)}}的其他基金
III: Small: Multiple Device Collaborative Learning in Real Heterogeneous and Dynamic Environments
III:小:真实异构动态环境中的多设备协作学习
- 批准号:
2311990 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
ML Basis for Intelligence Augmentation:Toward Personalized Modeling, Reasoning under Data-Knowledge Symbiosis, and Interpretable Interaction for AI-assisted Human Decision-making
智能增强的机器学习基础:面向人工智能辅助人类决策的个性化建模、数据知识共生下的推理和可解释的交互
- 批准号:
2040381 - 财政年份:2021
- 资助金额:
$ 49.99万 - 项目类别:
Continuing Grant
Collaborative Research: SCH: Trustworthy and Explainable AI for Neurodegenerative Diseases
合作研究:SCH:值得信赖且可解释的人工智能治疗神经退行性疾病
- 批准号:
2123952 - 财政年份:2021
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
III: Small: A New Approach to Latent Space Learning with Diversity-Inducing Regularization and Applications to Healthcare Data Analytics
III:小型:具有多样性诱导正则化的潜在空间学习新方法及其在医疗保健数据分析中的应用
- 批准号:
1617583 - 财政年份:2016
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
XPS: FULL: Broad-Purpose, Aggressively Asynchronous and Theoretically Sound Parallel Large-scale Machine Learning
XPS:FULL:用途广泛、积极异步且理论上合理的并行大规模机器学习
- 批准号:
1629559 - 财政年份:2016
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
BIGDATA: F: DKA: Collaborative Research: Theory and Algorithms for Parallel Probabilistic Inference with Big Data, via Big Model, in Realistic Distributed Computing Environments
BIGDATA:F:DKA:协作研究:在现实分布式计算环境中通过大模型进行大数据并行概率推理的理论和算法
- 批准号:
1447676 - 财政年份:2014
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
III: Small: Collaborative Research: Efficient, Nonparametric and Local-Minimum-Free Latent Variable Models: With Application to Large-Scale Computer Vision and Genomics
III:小型:协作研究:高效、非参数和局部最小自由潜变量模型:应用于大规模计算机视觉和基因组学
- 批准号:
1218282 - 财政年份:2012
- 资助金额:
$ 49.99万 - 项目类别:
Continuing Grant
III: Small: Collaborative Research: Using Large-Scale Image Data for Online Social Media Analysis
III:小:协作研究:使用大规模图像数据进行在线社交媒体分析
- 批准号:
1115313 - 财政年份:2011
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
Collaborative Research: Discovering and Exploiting Latent Communities in Social Media
协作研究:发现和利用社交媒体中的潜在社区
- 批准号:
1111142 - 财政年份:2011
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
Indexing, Mining and Modeling Spatio-Temporal Patterns of Gene Expressions
基因表达时空模式的索引、挖掘和建模
- 批准号:
0640543 - 财政年份:2007
- 资助金额:
$ 49.99万 - 项目类别:
Continuing Grant
相似国自然基金
基于NRF2调控KPNB1促进PD-L1核转位介导非小细胞肺癌免疫治疗耐药的机制研究
- 批准号:82303969
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
小胶质细胞调控外侧隔核-腹侧被盖区神经环路介导社交奖赏障碍的机制研究
- 批准号:82304474
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
肾去交感神经术促进下丘脑室旁核小胶质细胞M2型极化减轻心衰损伤的机制研究
- 批准号:82370387
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
空间邻近标记技术研究莱茵衣藻蛋白核小管与碳浓缩机制的潜在关系
- 批准号:32300220
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
polyG蛋白聚集体诱导小胶质细胞活化在神经元核内包涵体病中的作用及机制研究
- 批准号:82301603
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
CNS Core: Small: Core Scheduling Techniques and Programming Abstractions for Scalable Serverless Edge Computing Engine
CNS Core:小型:可扩展无服务器边缘计算引擎的核心调度技术和编程抽象
- 批准号:
2322919 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
CNS Core: Small: Network Wide Sensing by Leveraging Cellular Communication Networks
CNS 核心:小型:利用蜂窝通信网络进行全网络传感
- 批准号:
2343469 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
CNS Core: Small: Intelligent Fault Injection to Expose and Reproduce Production-Grade Bugs in Cloud Systems
CNS 核心:小型:智能故障注入以暴露和重现云系统中的生产级错误
- 批准号:
2317698 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
CNS Core: Small: Repurposing Smartphones to Minimize Carbon
CNS 核心:小型:重新利用智能手机以最大限度地减少碳排放
- 批准号:
2233894 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
- 批准号:
2230945 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant