Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers

协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法

基本信息

  • 批准号:
    2106661
  • 负责人:
  • 金额:
    $ 22.64万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-10-01 至 2023-11-30
  • 项目状态:
    已结题

项目摘要

Today's deep learning (DL) revolution is enabled by efficient deep neural network (DNN) training methods that capture important patterns within large quantities of data in compact, easily usable DNN models. DL methods are applied routinely to tasks like natural language translation and image labeling--and, in science and engineering, to problems as diverse as drug design, environmental monitoring, and fusion energy. Yet as data sizes increase and DL methods grow in sophistication, the time required to train new models often emerges as a major challenge. The Scalable Deep Learning (ScaDL) project will address this challenge by making it possible to use specialized high-performance computing (HPC) systems to train bigger models more rapidly. Efficient use of the thousands of powerful processors in modern HPC systems for DNN training has previously been stymied by communication costs that grow rapidly with the number of processors used. ScaDL will overcome this obstacle by developing new DNN training methods that reduce communication requirements by performing additional computation, by validating the effectiveness of these new methods in a range of scientific applications that use DL in different ways, and by integrating the new methods into scalable DL software for use by domain scientists, computer scientists, and engineers supporting DL application in HPC centers. By permitting the use of powerful HPC systems to train DNN models thousands of times faster than on a single computer, ScaDL will enable advances in many areas of science and engineering. The project will also contribute to educational outcomes by engaging PhD students in project goals, by using ScaDL tools in a new DL systems engineering class at the University of Chicago, and by enlisting participants in summer schools at the Texas Advanced Computing Center (TACC) and U. Chicago, both of which target recruitment of students from underserved communities at the graduate, undergraduate, and high-school levels, to apply the tools to scientific problems. ScaDL's focus on science applications and education aligns the project with NSF's mission of promoting the progress of science.The ScaDL project contributes to science in two ways. First, it explores new techniques for enhancing the speed and scalability of commonly used optimization methods without losing model performance, by: 1) exploiting scalable algorithms for second-order information approximation; 2) developing methods for adapting to different computer hardware by tuning computation and communication to maximize training speed; 3) exploring compression techniques to reduce communication overheads; 4) using well-known benchmark applications to evaluate the convergence of ScaDL; and 5) applying its new algorithms and systems to science applications. Second, it will release an open-source implementation of the proposed algorithms and system. The implementation will be available on a variety of hardware platforms and capable of choosing the ratio of computation and communication required to make efficient use of the computation and communication hardware on a particular HPC system. The resulting algorithms and system will help disseminate ScaDL research results to a wide spectrum of research domains and users, and promote the adoption of the new methods in practical settings.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当今的深度学习 (DL) 革命是通过高效的深度神经网络 (DNN) 训练方法实现的,这些方法可以在紧凑、易于使用的 DNN 模型中捕获大量数据中的重要模式。深度学习方法通​​常应用于自然语言翻译和图像标记等任务,在科学和工程领域,也应用于药物设计、环境监测和聚变能源等各种问题。然而,随着数据规模的增加和深度学习方法的复杂化,训练新模型所需的时间往往成为一个重大挑战。可扩展深度学习 (ScaDL) 项目将通过使用专门的高性能计算 (HPC) 系统更快地训练更大的模型来应对这一挑战。现代 HPC 系统中数千个强大的处理器的高效使用用于 DNN 训练,此前曾因通信成本随着所使用的处理器数量而快速增长而受到阻碍。 ScaDL 将克服这一障碍,开发新的 DNN 训练方法,通过执行额外的计算来减少通信需求,验证这些新方法在以不同方式使用 DL 的一系列科学应用中的有效性,以及将新方法集成到可扩展的 DL 中供领域科学家、计算机科学家和工程师使用的软件,支持 HPC 中心的 DL 应用。通过允许使用强大的 HPC 系统以比单台计算机快数千倍的速度训练 DNN 模型,ScaDL 将推动许多科学和工程领域的进步。该项目还将通过让博士生参与项目目标、在芝加哥大学新的深度学习系统工程课程中使用 ScaDL 工具以及招募德克萨斯州高级计算中心 (TACC) 暑期学校的参与者来促进教育成果芝加哥大学,这两个学校的目标都是从服务匮乏的社区招收研究生、本科生和高中生,将这些工具应用于科学问题。 ScaDL 对科学应用和教育的关注使该项目与 NSF 促进科学进步的使命保持一致。ScaDL 项目通过两个方式为科学做出贡献。首先,它探索了在不损失模型性能的情况下提高常用优化方法的速度和可扩展性的新技术,方法是:1)利用可扩展算法进行二阶信息近似; 2)通过调整计算和通信来开发适应不同计算机硬件的方法,以最大限度地提高训练速度; 3)探索压缩技术以减少通信开销; 4)使用著名的基准应用程序来评估ScaDL的收敛性; 5)将其新算法和系统应用于科学应用。 其次,它将发布所提出的算法和系统的开源实现。该实现可在各种硬件平台上使用,并且能够选择有效利用特定 HPC 系统上的计算和通信硬件所需的计算和通信比率。由此产生的算法和系统将有助于将 ScaDL 研究成果传播给广泛的研究领域和用户,并促进新方法在实际环境中的采用。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准。

项目成果

期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Deep Neural Network Training With Distributed K-FAC
使用分布式 K-FAC 进行深度神经网络训练
KAISA: an adaptive second-order optimizer framework for deep neural networks
KAISA:深度神经网络的自适应二阶优化器框架
  • DOI:
    10.1145/3458817.3476152
  • 发表时间:
    2021-11
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Pauloski, J. Gregory;Huang, Qi;Huang, Lei;Venkataraman, Shivaram;Chard, Kyle;Foster, Ian;Zhang, Zhao
  • 通讯作者:
    Zhang, Zhao
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Zhao Zhang其他文献

The genomic history of the Iberian Peninsula over the past 8000 years
伊比利亚半岛过去8000年的基因组历史
  • DOI:
    10.4236/jbbs.2019.96018
  • 发表时间:
    2024-09-14
  • 期刊:
  • 影响因子:
    0
  • 作者:
    I. Olalde;Swapan Mallick;Nick Patterson;N. Rohl;Mouco;Marina Silva;Katharina Dulias;C. Edwards;Francesca G;ini;ini;Maria;Pala;Pedro;Soares;Manuel;Ferr;o;o;Nicole;Adamski;Broom;khoshbacht;khoshbacht;O. Cheronet;B. Culleton;Daniel Fern;es;es;Marie Lawson;Matthew Mah;Jonas Oppenheimer;Kristin Stewardson;Zhao Zhang;Juan Manuel Jiménez Arenas;Isidro Jorge Toro Moyano;Domingo C. Salazar;P. Castanyer;Marta Santos;J. Tremoleda;Marina Lozano;Pablo García;Borja;J. Fernández;J. A. Mujika;Cecilio Barroso;J. Bermúdez;E. Mínguez;Josep Burch;Neus Coromina;David Vivó;A. Cebrià;Josep Maria Fullola;Oreto García‐Puchol;J. I. Morales;F. Xavier;12;Oms;Tona;Majó;Josep;Vergés;Antònia;Díaz;Imma;13;Castanyer;F. J. López;A. M. Silva;C. Alonso;Germán;Delibes;de;Castro;Javier;Jiménez;Echevarría;Adolfo;Moreno;Guillermo Pascual Berlanga;Pablo Ramos;José Ramos Muñoz;E. Vij;e;e;16;Vila;Gustau Aguilella Arzo;Ángel Esparza Arroyo;K. Lillios;Jennifer Mack;J. Velasco;A. Waterman;Luis Benítez de Lugo Enrich;María Benito;18;Sánchez;B. Agustí;F. Codina;Gabriel de Prado;A. Estalrrich;Álvaro;Fernández;Flores;Clive;Finlayson;Geraldine;Stewart;20;Francisco Giles;Antonio Rosas;V. González;Gabriel García Atiénzar;M. S. H. Pérez;Arm;o Llanos;o;Carrión Marco;Isabel Beneyto;David López;Mar Tormo;A. C. Valera;C. Blasco;Corina Liesau;Patricia Ríos;Joan Daura;Jesús de Pedro Michó;Agustín A Diez Castillo;R. F. Fernández;R. Garrido;V. S. Gonçalves;E. Guerra;Ana Mercedes;26;Herrero;Joaquim Juan;Dani López;S. McClure;Merino Pérez;Arturo Oliver Foix;Montse Borràs;A. Sousa;Manuel Vidal Encinas;D. Kennett;Martin B. Richards;K. Alt;W. Haak;R. Pinhasi;C. Lalueza;David Reich
  • 通讯作者:
    David Reich
Hawkeye: Change-targeted Testing for Android Apps based on Deep Reinforcement Learning
Hawkeye:基于深度强化学习的 Android 应用变更目标测试
Identification of microenvironment‐related genes with prognostic value in clear cell renal cell carcinoma
鉴定对透明细胞肾细胞癌具有预后价值的微环境相关基因
  • DOI:
    10.1002/jcb.29654
  • 发表时间:
    2020-01-21
  • 期刊:
  • 影响因子:
    4
  • 作者:
    Zhao Zhang;Zeyan Li;Zhao Liu;Xiang Zhang;Nengwang Yu;Zhonghua Xu
  • 通讯作者:
    Zhonghua Xu
A performance comparison of DRAM memory system optimizations for SMT processors
SMT 处理器的 DRAM 内存系统优化的性能比较
Association Between Sex and Immune-Related Adverse Events During Immune Checkpoint Inhibitor Therapy.
免疫检查点抑制剂治疗期间性别与免疫相关不良事件之间的关联。
  • DOI:
    10.1093/jnci/djab035
  • 发表时间:
    2021-03-10
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ying Jing;Yongchang Zhang;Jing Wang;Kunyan Li;Xue Chen;Jianfu Heng;Qian Gao;Youqiong Ye;Zhao Zhang;Yaoming Liu;Y. Lou;Steven H. Lin;L. Diao;Hong Liu;Xiang Chen;G. Mills;Leng Han
  • 通讯作者:
    Leng Han

Zhao Zhang的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Zhao Zhang', 18)}}的其他基金

Collaborative Research: Frameworks: hpcGPT: Enhancing Computing Center User Support with HPC-enriched Generative AI
协作研究:框架:hpcGPT:通过 HPC 丰富的生成式 AI 增强计算中心用户支持
  • 批准号:
    2411294
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
CAREER: Efficient and Scalable Large Foundational Model Training on Supercomputers for Science
职业:科学超级计算机上高效且可扩展的大型基础模型训练
  • 批准号:
    2340011
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
  • 批准号:
    2401245
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
  • 批准号:
    2311766
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
  • 批准号:
    2312689
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant
Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers
协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法
  • 批准号:
    2401246
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
  • 批准号:
    2401244
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant
Collaborative Research: OAC Core: Small: Efficient and Policy-driven Burst Buffer Sharing
合作研究:OAC Core:小型:高效且策略驱动的突发缓冲区共享
  • 批准号:
    2008388
  • 财政年份:
    2020
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
SHF: Medium:Collaborative Research: Architectural and System Support for Building Versatile Memory Systems
SHF:媒介:协作研究:构建多功能内存系统的架构和系统支持
  • 批准号:
    1643271
  • 财政年份:
    2016
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant
SHF: Medium:Collaborative Research: Architectural and System Support for Building Versatile Memory Systems
SHF:媒介:协作研究:构建多功能内存系统的架构和系统支持
  • 批准号:
    1514229
  • 财政年份:
    2015
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant

相似国自然基金

IGF-1R调控HIF-1α促进Th17细胞分化在甲状腺眼病发病中的机制研究
  • 批准号:
    82301258
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
CTCFL调控IL-10抑制CD4+CTL旁观者激活促口腔鳞状细胞癌新辅助免疫治疗抵抗机制研究
  • 批准号:
    82373325
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
RNA剪接因子PRPF31突变导致人视网膜色素变性的机制研究
  • 批准号:
    82301216
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
血管内皮细胞通过E2F1/NF-kB/IL-6轴调控巨噬细胞活化在眼眶静脉畸形中的作用及机制研究
  • 批准号:
    82301257
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于多元原子间相互作用的铝合金基体团簇调控与强化机制研究
  • 批准号:
    52371115
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402946
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了