Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers
协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法
基本信息
- 批准号:2401246
- 负责人:
- 金额:$ 22.64万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2024-10-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Today's deep learning (DL) revolution is enabled by efficient deep neural network (DNN) training methods that capture important patterns within large quantities of data in compact, easily usable DNN models. DL methods are applied routinely to tasks like natural language translation and image labeling--and, in science and engineering, to problems as diverse as drug design, environmental monitoring, and fusion energy. Yet as data sizes increase and DL methods grow in sophistication, the time required to train new models often emerges as a major challenge. The Scalable Deep Learning (ScaDL) project will address this challenge by making it possible to use specialized high-performance computing (HPC) systems to train bigger models more rapidly. Efficient use of the thousands of powerful processors in modern HPC systems for DNN training has previously been stymied by communication costs that grow rapidly with the number of processors used. ScaDL will overcome this obstacle by developing new DNN training methods that reduce communication requirements by performing additional computation, by validating the effectiveness of these new methods in a range of scientific applications that use DL in different ways, and by integrating the new methods into scalable DL software for use by domain scientists, computer scientists, and engineers supporting DL application in HPC centers. By permitting the use of powerful HPC systems to train DNN models thousands of times faster than on a single computer, ScaDL will enable advances in many areas of science and engineering. The project will also contribute to educational outcomes by engaging PhD students in project goals, by using ScaDL tools in a new DL systems engineering class at the University of Chicago, and by enlisting participants in summer schools at the Texas Advanced Computing Center (TACC) and U. Chicago, both of which target recruitment of students from underserved communities at the graduate, undergraduate, and high-school levels, to apply the tools to scientific problems. ScaDL's focus on science applications and education aligns the project with NSF's mission of promoting the progress of science.The ScaDL project contributes to science in two ways. First, it explores new techniques for enhancing the speed and scalability of commonly used optimization methods without losing model performance, by: 1) exploiting scalable algorithms for second-order information approximation; 2) developing methods for adapting to different computer hardware by tuning computation and communication to maximize training speed; 3) exploring compression techniques to reduce communication overheads; 4) using well-known benchmark applications to evaluate the convergence of ScaDL; and 5) applying its new algorithms and systems to science applications. Second, it will release an open-source implementation of the proposed algorithms and system. The implementation will be available on a variety of hardware platforms and capable of choosing the ratio of computation and communication required to make efficient use of the computation and communication hardware on a particular HPC system. The resulting algorithms and system will help disseminate ScaDL research results to a wide spectrum of research domains and users, and promote the adoption of the new methods in practical settings.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当今的深度学习(DL)革命是通过有效的深度神经网络(DNN)训练方法来实现的,该方法在紧凑,易于使用的DNN模型中捕获大量数据中的重要模式。 DL方法常规应用于自然语言翻译和图像标签等任务,以及在科学和工程中,以及像药物设计,环境监测和融合能量一样多样化的问题。然而,随着数据尺寸的增加和DL方法的增长,训练新模型所需的时间通常会成为一个主要的挑战。可扩展的深度学习(SCADL)项目将通过使用专业的高性能计算(HPC)系统来更快地训练更大的模型,从而解决这一挑战。在现代HPC系统中,有效利用成千上万的强大处理器进行DNN培训,以前被沟通成本迅速增长,随着所使用的处理器数量而迅速增长。 SCADL将通过开发新的DNN训练方法来克服这一障碍,从而通过在一系列以不同方式使用DL的科学应用中验证这些新方法的有效性来减少交流需求,并通过将新方法集成到可扩展的DL软件中,以供域科学家,计算机科学家,计算机科学家,计算机科学家,以及支持HPC Centers中的DL DL应用程序。通过允许使用强大的HPC系统训练DNN型号数千倍的速度要比一台计算机快数千倍,SCADL将在许多科学和工程领域的进步。该项目还将通过在芝加哥大学的新DL系统工程课程中使用SCADL工具,以及在得克萨斯州高级计算中心(TACC)和U. Chicago的暑期学校中使用SCADL工具,并通过在毕业生不熟悉的培训中招聘人员招聘贫困的学生,并将其招募到较高的级别的学生中,这两个项目还将通过在芝加哥大学的新DL系统工程课程中使用SCADL工具来为教育成果做出贡献。 Scadl对科学应用和教育的关注使该项目与NSF促进科学进步的使命保持一致。SCADL项目通过两种方式为科学做出了贡献。首先,它探讨了新技术,以提高常用优化方法的速度和可扩展性而不会丢失模型性能,1)利用可伸缩算法来实现二阶信息近似; 2)通过调整计算和通信以最大化训练速度来开发适应不同计算机硬件的方法; 3)探索压缩技术以减少沟通开销; 4)使用众所周知的基准应用来评估SCADL的收敛性; 5)将其新算法和系统应用于科学应用。 其次,它将发布提出的算法和系统的开源实现。该实现将在各种硬件平台上可用,能够选择有效利用特定HPC系统上计算和通信硬件所需的计算和通信比率。由此产生的算法和系统将有助于将SCADL研究结果传播到各种研究领域和用户,并在实际环境中促进采用新方法。该奖项反映了NSF的法定任务,并被视为值得通过基金会的知识分子优点和更广泛影响的评估来审查审查的审查标准。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Zhao Zhang其他文献
Tunable erbium-doped fiber ring laser based on an all-fiber filter
基于全光纤滤波器的可调谐掺铒光纤环形激光器
- DOI:
10.1117/12.2000105 - 发表时间:
2012 - 期刊:
- 影响因子:0
- 作者:
X. Ji;Z. Cao;Zhao Zhang;Tao Shui;Wenliang Hao;B. Yu - 通讯作者:
B. Yu
An efficient and convenient formal synthesis of Jaspine B from D-xylose.
由 D-木糖高效、便捷地正式合成 Jaspine B。
- DOI:
10.1016/j.carres.2012.01.013 - 发表时间:
2012-04 - 期刊:
- 影响因子:3.1
- 作者:
Zhao Zhang;Yu-Tao Zhao;Wen Qu;Hong-Min Liu - 通讯作者:
Hong-Min Liu
P2P存储云中安全高效和细粒度的数据访问控制机制
- DOI:
- 发表时间:
2014 - 期刊:
- 影响因子:6.5
- 作者:
Heng He;Ruixuan Li;Xinhua Dong;Zhao Zhang - 通讯作者:
Zhao Zhang
A 12bit 39ps two-step Time-to-Digital Converter in 40nm CMOS
采用 40nm CMOS 的 12 位 39ps 两步时间数字转换器
- DOI:
10.1109/icsict55466.2022.9963212 - 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Xuxi Liu;Zhao Zhang;Tao Yin;Rui Wu;P. Feng;Liyuan Liu - 通讯作者:
Liyuan Liu
A workflow for building surface-based reservoir models using NURBS curves, coons patches, unstructured tetrahedral meshes and open-source libraries
使用 NURBS 曲线、coons 面片、非结构化四面体网格和开源库构建基于表面的油藏模型的工作流程
- DOI:
10.1016/j.cageo.2018.09.001 - 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Zhao Zhang;Z. Yin;Xia Yan - 通讯作者:
Xia Yan
Zhao Zhang的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Zhao Zhang', 18)}}的其他基金
CAREER: Efficient and Scalable Large Foundational Model Training on Supercomputers for Science
职业:科学超级计算机上高效且可扩展的大型基础模型训练
- 批准号:
2340011 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: Frameworks: hpcGPT: Enhancing Computing Center User Support with HPC-enriched Generative AI
协作研究:框架:hpcGPT:通过 HPC 丰富的生成式 AI 增强计算中心用户支持
- 批准号:
2411294 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
- 批准号:
2312689 - 财政年份:2023
- 资助金额:
$ 22.64万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
- 批准号:
2401244 - 财政年份:2023
- 资助金额:
$ 22.64万 - 项目类别:
Continuing Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
- 批准号:
2311766 - 财政年份:2023
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
- 批准号:
2401245 - 财政年份:2023
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers
协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法
- 批准号:
2106661 - 财政年份:2021
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Small: Efficient and Policy-driven Burst Buffer Sharing
合作研究:OAC Core:小型:高效且策略驱动的突发缓冲区共享
- 批准号:
2008388 - 财政年份:2020
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
SHF: Medium:Collaborative Research: Architectural and System Support for Building Versatile Memory Systems
SHF:媒介:协作研究:构建多功能内存系统的架构和系统支持
- 批准号:
1643271 - 财政年份:2016
- 资助金额:
$ 22.64万 - 项目类别:
Continuing Grant
SHF: Medium:Collaborative Research: Architectural and System Support for Building Versatile Memory Systems
SHF:媒介:协作研究:构建多功能内存系统的架构和系统支持
- 批准号:
1514229 - 财政年份:2015
- 资助金额:
$ 22.64万 - 项目类别:
Continuing Grant
相似国自然基金
支持二维毫米波波束扫描的微波/毫米波高集成度天线研究
- 批准号:62371263
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
腙的Heck/脱氮气重排串联反应研究
- 批准号:22301211
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
水系锌离子电池协同性能调控及枝晶抑制机理研究
- 批准号:52364038
- 批准年份:2023
- 资助金额:33 万元
- 项目类别:地区科学基金项目
基于人类血清素神经元报告系统研究TSPYL1突变对婴儿猝死综合征的致病作用及机制
- 批准号:82371176
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
FOXO3 m6A甲基化修饰诱导滋养细胞衰老效应在补肾法治疗自然流产中的机制研究
- 批准号:82305286
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 22.64万 - 项目类别:
Standard Grant