CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning
CRII:OAC:基于 GPU 的大规模深度学习的压缩器辅助集体通信框架
基本信息
- 批准号:2348465
- 负责人:
- 金额:$ 17.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2024
- 资助国家:美国
- 起止时间:2024-06-01 至 2026-05-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
The scale of modern deep learning expands rapidly due to larger training datasets, larger neural network models, and new algorithms/techniques. It presents significant challenges to the current distributed high-performance computing (HPC) infrastructures since larger-scale training incurs more expensive collective communication costs for passing more significant gradient messages among nodes. A more powerful hardware platform may not necessarily help overcome this performance bottleneck, as optimized middleware supports are demanded to unleash the platform's computing capacity fully. This project aims to close the gap between the training scale and the infrastructure's capability by providing gradient-specific lossy compression techniques and an optimized GPU-aware compressor-assisted collective communication framework to reduce the gradient message sizes and improve communication performance systematically. The deliverables can help the end-users to get significantly faster training speed with preserved training accuracy. The success of this research can promote progress in both traditional AI research, such as computer vision and natural language processing, and emerging AI for Science research for domain sciences, including cosmology, X-ray imaging, and drug discovery. This project also contributes to educational and engagement activities by leveraging the research outcome to develop new curriculums and teaching tools for mentoring college students and training K-12 students in HPC and AI areas.Using current collective communication libraries for large-scale distributed deep learning can yield significant communication overhead since the gradient messages are large. Applying lossy compression techniques to gradient messages could potentially reduce the communication overhead. However, several important open research questions should be investigated to ensure the performance gain: 1) Are the current lossy compressors efficient enough for gradient data? 2) How can lossy compressors efficiently integrate into a GPU-aware collective communication framework? 3) How could the GPU resources be efficiently shared among different tasks? This project addresses these questions and delivers a novel compressor-assisted GPU-aware collective communication framework for large-scale deep learning. Specifically, the team 1) investigates the efficiency of using error-bounded scientific data lossy compressors to compress gradient data and develops a new gradient compressor by leveraging the advantages of different existing compressors to achieve a better compression ratio and training accuracy; 2) designs the new compressor's GPU implementation and integrates it into the GPU-aware MPI, then optimizes the workflow to ultimately hide the gradient compressor's cost in the communication cost; 3) profiles the GPU resource utilization of both the deep learning training and the compressor-assist collective communications, and designs a new communication framework to enable task scheduling of training, compression, and collectives' computations (e.g., reduction) on the same GPU to achieve optimal resource sharing for the end-to-end deep learning training.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
由于更大的训练数据集、更大的神经网络模型和新的算法/技术,现代深度学习的规模迅速扩大。它对当前的分布式高性能计算(HPC)基础设施提出了重大挑战,因为大规模训练需要在节点之间传递更重要的梯度消息,从而产生更昂贵的集体通信成本。更强大的硬件平台不一定能帮助克服这个性能瓶颈,因为需要优化的中间件支持才能充分释放平台的计算能力。该项目旨在通过提供特定于梯度的有损压缩技术和优化的 GPU 感知压缩器辅助集体通信框架来缩小训练规模和基础设施能力之间的差距,以减少梯度消息大小并系统地提高通信性能。这些可交付成果可以帮助最终用户在保持训练准确性的同时获得显着更快的训练速度。这项研究的成功可以促进计算机视觉和自然语言处理等传统人工智能研究的进展,以及宇宙学、X射线成像和药物发现等领域科学研究的新兴人工智能科学研究的进展。该项目还通过利用研究成果开发新课程和教学工具来指导大学生并培训 HPC 和 AI 领域的 K-12 学生,从而为教育和参与活动做出贡献。使用当前的集体通信库进行大规模分布式深度学习可以由于梯度消息很大,因此会产生大量的通信开销。将有损压缩技术应用于梯度消息可能会减少通信开销。然而,为了确保性能增益,应该研究几个重要的开放研究问题:1)当前的有损压缩器对于梯度数据是否足够有效? 2)有损压缩器如何有效地集成到GPU感知的集体通信框架中? 3)如何在不同任务之间有效共享GPU资源?该项目解决了这些问题,并为大规模深度学习提供了一种新颖的压缩器辅助 GPU 感知集体通信框架。具体来说,该团队1)研究了使用有误差限制的科学数据有损压缩器来压缩梯度数据的效率,并利用现有不同压缩器的优点开发了一种新的梯度压缩器,以实现更好的压缩比和训练精度; 2)设计新压缩器的GPU实现并将其集成到GPU感知的MPI中,然后优化工作流程以最终将梯度压缩器的成本隐藏在通信成本中; 3) 分析深度学习训练和压缩器辅助集体通信的 GPU 资源利用率,并设计一个新的通信框架,以实现在同一 GPU 上进行训练、压缩和集体计算(例如,缩减)的任务调度,实现端到端深度学习培训的最佳资源共享。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Xiaodong Yu其他文献
A high-throughput and simultaneous determination of Combretastatin A-4 Phosphate and its metabolites in human plasma by HPLC-MS/MS: Application to a clinical pharmacokinetic study.
采用 HPLC-MS/MS 高通量同时测定人血浆中康布他汀 A-4 磷酸盐及其代谢物:在临床药代动力学研究中的应用。
- DOI:
10.1002/bmc.5204 - 发表时间:
2021-06-24 - 期刊:
- 影响因子:0
- 作者:
Qizhen Wu;Qian Wang;Yixuan Wang;Jingqi Huang;Yalin Fang;Weiyi Wu;Wenying Wu;Fan Wu;Xiaodong Yu;Yan Sun - 通讯作者:
Yan Sun
Physicochemical, functional, and antioxidant properties of black soldier fly larvae protein.
黑水虻幼虫蛋白的理化、功能和抗氧化特性。
- DOI:
10.1111/1750-3841.16846 - 发表时间:
2023-11-20 - 期刊:
- 影响因子:3.9
- 作者:
Wangxiang Huang;Chen Wang;Qianzi Chen;Feng Chen;Haohan Hu;Jianfei Li;Qiyi He;Xiaodong Yu - 通讯作者:
Xiaodong Yu
JMJD2A promotes the development of castration-resistant prostate cancer by activating androgen receptor enhancer and inhibiting the cGAS-STING pathway.
JMJD2A通过激活雄激素受体增强剂和抑制cGAS-STING途径促进去势抵抗性前列腺癌的发展。
- DOI:
10.1002/mc.23753 - 发表时间:
2024-05-31 - 期刊:
- 影响因子:4.6
- 作者:
Xiang Cai;Xiaodong Yu;Tielong Tang;Yi Xu;Tao Wu - 通讯作者:
Tao Wu
High-Performance Ptychographic Reconstruction with Federated Facilities
使用联合设施进行高性能叠图重建
- DOI:
10.1007/978-3-030-96498-6_10 - 发表时间:
2021-11-22 - 期刊:
- 影响因子:0
- 作者:
Tekin Bicer;Xiaodong Yu;D. Ching;Ryan Chard;M. Cherukara;Bogdan Nicolae;R. Kettimuthu;Ian T. Foster - 通讯作者:
Ian T. Foster
Research on evaluation model of new power system information communication digitalization ability
新型电力系统信息通信数字化能力评价模型研究
- DOI:
10.1117/12.3004059 - 发表时间:
2023-10-11 - 期刊:
- 影响因子:0
- 作者:
Xiaodong Yu;Qingyong Guan;Wei Shuai Liu;Yang Yang;Haipeng Sun - 通讯作者:
Haipeng Sun
Xiaodong Yu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
Z8-12:OH和Z8-14:OAc分别维持梨小食心虫和李小食心虫性诱剂特异性的分子基础
- 批准号:
- 批准年份:2021
- 资助金额:35 万元
- 项目类别:地区科学基金项目
亚硝酰钌配合物[Ru(OAc)(2mqn)2NO]的光异构反应机理研究
- 批准号:21603131
- 批准年份:2016
- 资助金额:19.0 万元
- 项目类别:青年科学基金项目
机械化学条件下Mn(OAc)3促进的自由基串联反应研究
- 批准号:21242013
- 批准年份:2012
- 资助金额:10.0 万元
- 项目类别:专项基金项目
相似海外基金
OAC Core: Cost-Adaptive Monitoring and Real-Time Tuning at Function-Level
OAC核心:功能级成本自适应监控和实时调优
- 批准号:
2402542 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403088 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403090 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant