SHF: Large: Collaborative Research: Next Generation Communication Mechanisms exploiting Heterogeneity, Hierarchy and Concurrency for Emerging HPC Systems

SHF：大型：协作研究：利用新兴 HPC 系统的异构性、层次结构和并发性的下一代通信机制

基本信息

批准号：
1565431
负责人：
William Barth
金额：
$ 42.25万
依托单位：
University of Texas at Austin
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2016
资助国家：
美国
起止时间：
2016-08-15 至 2020-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1565431&HistoricalAwards=false
关键词：
SHF Large Collaborative Research Next

项目摘要

This award was partially supported by the CIF21 Software Reuse Venture whose goals are to support pathways towards sustainable software elements through their reuse, and to emphasize the critical role of reusable software elements in a sustainable software cyberinfrastructure to support computational and data-enabled science and engineering.Parallel programming based on MPI (Message Passing Interface) is being used with increased frequency in academia, government (defense and non-defense uses), as well as emerging uses in scalable machine learning and big data analytics. The emergence of Dense Many-Core (DMC) architectures like Intel's Knights Landing (KNL) and accelerator/co-processor architectures like NVIDIA GPGPUs are enabling the design of systems with high compute density. This, coupled with the availability of Remote Direct Memory Access (RDMA)-enabled commodity networking technologies like InfiniBand, RoCE, and 10/40GigE with iWARP, is fueling the growth of multi-petaflop and ExaFlop systems. These DMC architectures have the following unique characteristics: deeper levels of hierarchical memory; revolutionary network interconnects; and heterogeneous compute power and data movement costs (with heterogeneity at chip-level and node-level). For these emerging systems, a combination of MPI and other programming models, known as MPI+X (where X can be PGAS, Tasks, OpenMP, OpenACC, or CUDA), are being targeted. The current generation communication protocols and mechanisms for MPI+X programming models cannot efficiently support the emerging DMC architectures. This leads to the following broad challenges: 1) How can high-performance and scalable communication mechanisms for next generation DMC architectures be designed to support MPI+X (including Task-based) programming models? and 2) How can the current and next generation applications be designed/co-designed with the proposed communication mechanisms?A synergistic and comprehensive research plan, involving computer scientists from The Ohio State University (OSU) and Ohio Supercomputer Center (OSC) and computational scientists from the Texas Advanced Computing Center (TACC), San Diego Supercomputer Center (SDSC) and University of California San Diego (UCSD), is proposed to address the above broad challenges with innovative solutions. The research will be driven by a set of applications from established NSF computational science researchers running large scale simulations on Stampede and Comet and other systems at OSC and OSU. The proposed designs will be integrated into the widely-used MVAPICH2 library and made available for public use. Multiple graduate and undergraduate students will be trained under this project as future scientists and engineers in HPC. The established national-scale training and outreach programs at TACC, SDSC and OSC will be used to disseminate the results of this research to XSEDE users. Tutorials will be organized at XSEDE, SC and other conferences to share the research results and experience with the community.

该奖项得到了 CIF21 Software Reuse Venture 的部分支持，其目标是通过重用支持实现可持续软件元素的途径，并强调可重用软件元素在可持续软件网络基础设施中的关键作用，以支持计算和数据驱动的科学与工程基于 MPI（消息传递接口）的并行编程在学术界、政府（国防和非国防用途）以及可扩展机器学习和大数据分析中的新兴用途越来越频繁地使用。 Intel Knights Landing (KNL) 等密集众核 (DMC) 架构和 NVIDIA GPGPU 等加速器/协处理器架构的出现，使得高计算密度系统的设计成为可能。再加上支持远程直接内存访问 (RDMA) 的商用网络技术（例如 InfiniBand、RoCE 和带有 iWARP 的 10/40GigE），正在推动多 petaflop 和 ExaFlop 系统的增长。这些 DMC 架构具有以下独特的特征：更深层次的分层内存；革命性的网络互连；异构计算能力和数据移动成本（芯片级和节点级的异构性）。对于这些新兴系统，MPI 和其他编程模型的组合（称为 MPI+X）（其中 X 可以是 PGAS、任务、OpenMP、OpenACC 或 CUDA）成为目标。 MPI+X 编程模型的当前一代通信协议和机制无法有效支持新兴的 DMC 架构。这带来了以下广泛的挑战：1）如何设计下一代DMC架构的高性能和可扩展的通信机制来支持MPI+X（包括基于任务的）编程模型？ 2) 如何利用所提出的通信机制来设计/共同设计当前和下一代应用程序？一项协同且全面的研究计划，涉及来自俄亥俄州立大学 (OSU) 和俄亥俄州超级计算机中心 (OSC) 的计算机科学家和计算来自德克萨斯高级计算中心 (TACC)、圣地亚哥超级计算机中心 (SDSC) 和加州大学圣地亚哥分校 (UCSD) 的科学家提出通过创新解决方案应对上述广泛挑战。这项研究将由美国国家科学基金会 (NSF) 计算科学研究人员的一系列应用程序推动，这些研究人员在 Stampede 和 Comet 以及 OSC 和 OSU 的其他系统上运行大规模模拟。拟议的设计将集成到广泛使用的 MAPICH2 库中并可供公众使用。多名研究生和本科生将在该项目下接受培训，成为未来高性能计算领域的科学家和工程师。 TACC、SDSC 和 OSC 已建立的国家级培训和推广计划将用于向 XSEDE 用户传播这项研究结果。将在 XSEDE、SC 和其他会议上组织教程，与社区分享研究成果和经验。