CRII: CIF: Unifying Scheduling and Optimization Techniques to Speed-up Distributed Stochastic Gradient Descent
CRII:CIF:统一调度和优化技术来加速分布式随机梯度下降
基本信息
- 批准号:1850029
- 负责人:
- 金额:$ 17.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2019
- 资助国家:美国
- 起止时间:2019-03-01 至 2022-02-28
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Stochastic gradient descent (SGD) is at the core of state-of-the-art supervised learning, which is revolutionizing inference and decision-making in many diverse applications such as self-driving cars, robotics, personalized search and recommendations, and medical diagnosis. Thus, improving the speed of stochastic gradient descent is a timely and important research problem. Due to the massive scale of neural network models and training data sets used today, it has become advantageous to parallelize SGD across multiple computing nodes. Although parallelizing SGD boosts the amount of data processed per iteration, it exposes the algorithm to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. The goal of this project is to design provably fast SGD algorithms that easily lend themselves to distributed implementations, and are robust to fluctuations in computation and network delays as well as unpredictable node failures. This project can assist in making machine learning universally accessible, without requiring access to expensive high-performance computing infrastructure. An open-source implementation of the resulting adaptive distributed SGD algorithms will be released. The research outcomes will also be incorporated into two new machine learning classes at Carnegie Mellon University, and into curriculum development and research sampler workshops for K-12 teachers and students.The speed of single-node SGD is typically measured in terms of the convergence of training error with respect to the number of iterations. In distributed SGD, the runtime per iteration depends on system-level factors such as the computation delays at worker nodes and the gradient aggregation mechanism. Thus, there is a critical need to understand the error convergence with respect to the wall-clock time rather than the number of iterations. This project will improve the true convergence of distributed SGD with respect to wall-clock time by jointly optimizing the runtime-per-iteration and error-versus-iterations. It will consider two popular distributed SGD frameworks, the parameter server model and the communication-efficient SGD model. The research is expected to provide novel runtime and error analyses of distributed SGD in these frameworks and design the first adaptive distributed SGD algorithms that strike the best error-runtime trade-off.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
随机梯度下降(SGD)是最先进的监督学习的核心,这在许多不同的应用中彻底改变了推理和决策,例如自动驾驶汽车,机器人技术,个性化搜索和建议以及医疗诊断。因此,提高随机梯度下降的速度是一个及时且重要的研究问题。由于当今使用的大量神经网络模型和训练数据集大规模,因此在多个计算节点上并行化SGD已成为有利的。尽管平行化的SGD增加了通过迭代处理的数据的数量,但它将算法暴露于无法预测的节点放缓和通信延迟,并源于计算基础结构中的可变性。该项目的目的是设计可证明快速的SGD算法,这些算法可以轻松地借助分布式实现,并且对计算和网络延迟的波动以及无法预测的节点失败是强大的。该项目可以帮助使机器学习普遍访问,而无需访问昂贵的高性能计算基础架构。将发布由此产生的自适应分布式SGD算法的开源实现。研究成果还将在卡内基·梅隆大学(Carnegie Mellon University)的两个新机器学习课程中纳入,并为K-12教师和学生的课程开发和研究采样器讲习班。单节点SGD的速度通常是根据训练错误相对于迭代次数的培训误差来衡量的。在分布式SGD中,每次迭代的运行时间取决于系统级因子,例如工人节点处的计算延迟和梯度聚合机制。因此,迫切需要了解误差收敛相对于壁锁定时间而不是迭代次数。该项目将通过共同优化每识别运行时和错误对话 - 材料来改善分布式SGD相对于墙锁定时间的真正收敛性。它将考虑两个流行的分布式SGD框架,即参数服务器模型和通信有效的SGD模型。预计该研究将在这些框架中提供分布式SGD的新型运行时和错误分析,并设计了第一个自适应分布式SGD算法,从而实现了最佳的错误暴跌权衡。该奖项反映了NSF的法定任务,并通过使用该基金会的智力功能和广泛的影响来评估Criteria criteria criteria criteria criteria criteria。
项目成果
期刊论文数量(6)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
ADAPTIVE QUANTIZATION OF MODEL UPDATES FOR COMMUNICATION-EFFICIENT FEDERATED LEARNING
- DOI:10.1109/icassp39728.2021.9413697
- 发表时间:2021-01-01
- 期刊:
- 影响因子:0
- 作者:Jhunjhunwala, Divyansh;Gadhikar, Advait;Eldar, Yonina C.
- 通讯作者:Eldar, Yonina C.
Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms
- DOI:
- 发表时间:2018-08
- 期刊:
- 影响因子:0
- 作者:Jianyu Wang;Gauri Joshi
- 通讯作者:Jianyu Wang;Gauri Joshi
Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms
- DOI:
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Jianyu Wang;Gauri Joshi
- 通讯作者:Jianyu Wang;Gauri Joshi
Slow and Stale Gradients Can Win the Race
- DOI:10.1109/jsait.2021.3103770
- 发表时间:2018-03
- 期刊:
- 影响因子:0
- 作者:Sanghamitra Dutta;Gauri Joshi;Soumyadip Ghosh;Parijat Dube;P. Nagpurkar
- 通讯作者:Sanghamitra Dutta;Gauri Joshi;Soumyadip Ghosh;Parijat Dube;P. Nagpurkar
Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD
- DOI:10.1109/icassp40776.2020.9053834
- 发表时间:2020-02
- 期刊:
- 影响因子:0
- 作者:Jianyu Wang;Hao Liang;Gauri Joshi
- 通讯作者:Jianyu Wang;Hao Liang;Gauri Joshi
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Gauri Joshi其他文献
Optimal relay placement for cellular coverage extension
用于扩展蜂窝覆盖范围的最佳中继布局
- DOI:
10.1109/ncc.2011.5734705 - 发表时间:
2011 - 期刊:
- 影响因子:0
- 作者:
Gauri Joshi;A. Karandikar - 通讯作者:
A. Karandikar
Synergy via Redundancy: Adaptive Replication Strategies and Fundamental Limits
通过冗余实现协同:自适应复制策略和基本限制
- DOI:
10.1109/tnet.2020.3047513 - 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Gauri Joshi;Dhruva Kaushal - 通讯作者:
Dhruva Kaushal
Budget Impact Analysis of a Computer-Delivered Brief Alcohol Intervention in Veterans Affairs (VA) Liver Clinics: A Randomized Controlled Trial
退伍军人事务部 (VA) 肝脏诊所计算机提供的短暂酒精干预的预算影响分析:随机对照试验
- DOI:
10.1080/07347324.2020.1760755 - 发表时间:
2020 - 期刊:
- 影响因子:0.9
- 作者:
A. Esmaeili;Wei Yu;Michael A. Cucciare;Ann S Combs;Gauri Joshi;K. Humphreys - 通讯作者:
K. Humphreys
Efficient Replication of Queued Tasks to Reduce Latency in Cloud Systems
有效复制排队任务以减少云系统中的延迟
- DOI:
- 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Gauri Joshi - 通讯作者:
Gauri Joshi
Can Your AI Differentiate Cats from Covid-19? Sample Efficient Uncertainty Estimation for Deep Learning Safety
您的 AI 能否将猫与 Covid-19 区分开来?深度学习安全性的样本有效不确定性估计
- DOI:
- 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Ankur Mallick;Chaitanya Dwivedi;B. Kailkhura;Gauri Joshi;Yong Han - 通讯作者:
Yong Han
Gauri Joshi的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Gauri Joshi', 18)}}的其他基金
CAREER: Frontiers of Distributed Machine Learning with Communication, Computation and Data Constraints
职业:具有通信、计算和数据约束的分布式机器学习前沿
- 批准号:
2045694 - 财政年份:2021
- 资助金额:
$ 17.5万 - 项目类别:
Continuing Grant
Collaborative Research: SHF: Medium: HERMES: On-Device Distributed Machine Learning via Model-Hardware Co-Design
协作研究:SHF:媒介:HERMES:通过模型硬件协同设计实现设备上分布式机器学习
- 批准号:
2107024 - 财政年份:2021
- 资助金额:
$ 17.5万 - 项目类别:
Continuing Grant
CIF: Small: Efficient Sequential Decision-Making and Inference in the Small Data Regime
CIF:小:小数据机制中的高效顺序决策和推理
- 批准号:
2007834 - 财政年份:2020
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CSR: Small: ARTEMIS: Algorithm-Hardware Co-Design for Efficient Machine Learning Systems
CSR:小型:ARTEMIS:高效机器学习系统的算法硬件协同设计
- 批准号:
1815780 - 财政年份:2018
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
相似国自然基金
SHR和CIF协同调控植物根系凯氏带形成的机制
- 批准号:31900169
- 批准年份:2019
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
相似海外基金
CIF: Small: Fundamental Limits of Empirical Risk Minimization in High Dimensions: A Unifying Gaussian Processes Approach
CIF:小:高维经验风险最小化的基本限制:统一高斯过程方法
- 批准号:
2009030 - 财政年份:2020
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CIF: Small: Mathematical Tools for Unifying and Simplifying the Analysis and Optimization of Wireless Networks
CIF:小型:用于统一和简化无线网络分析和优化的数学工具
- 批准号:
1910868 - 财政年份:2019
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CIF: Small: A Simple and Unifying Optimization Framework for Signal and Information Processing Problems with Min-Max Structures
CIF:Small:针对具有最小-最大结构的信号和信息处理问题的简单且统一的优化框架
- 批准号:
1910385 - 财政年份:2019
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CIF: NeTS: Medium: Collaborative Research: Unifying Data Synchronization
CIF:NetTS:媒介:协作研究:统一数据同步
- 批准号:
1563710 - 财政年份:2016
- 资助金额:
$ 17.5万 - 项目类别:
Continuing Grant
CIF: NeTS: Medium: Collaborative Research: Unifying Data Synchronization
CIF:NetTS:媒介:协作研究:统一数据同步
- 批准号:
1563753 - 财政年份:2016
- 资助金额:
$ 17.5万 - 项目类别:
Continuing Grant