SHF: Medium: Collaborative Research: Next-Generation Message Passing for Parallel Programming: Resiliency, Time-to-Solution, Performance-Portability, Scalability, and QoS
SHF:中:协作研究:并行编程的下一代消息传递:弹性、解决时间、性能可移植性、可扩展性和 QoS
基本信息
- 批准号:1822191
- 负责人:
- 金额:$ 52.37万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2017
- 资助国家:美国
- 起止时间:2017-10-01 至 2022-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Parallel programming based on MPI is being used with increased frequency in academia, government (defense and non-defense uses), as well as emerging uses in scalable machine learning and big data analytics. Emerging supercomputer systems will have more faults and MPI needs to be able to workaround such faults to be appropriate to these emerging situations, rather than causing an entire application to fail. Collaborative, transformative message passing research for High Performance Computing (HPC) critical to performance-portable parallel programming in new and forthcoming scalable systems (with a strategy of "best practice-first, standardization-later") is being reduced to practice. A substantial subset of the Message Passing Interface (MPI-3/4) application programmer interface is being made fault tolerant through extensions with weak collective transactions that synchronize between parallel tasks. This research studies the novel model that localizes faults, provides tunable fault-free overhead, allows for multiple kinds of faults, enables hierarchical recovery, and is data-parallel relevant. Fault modeling of underlying networks is being studied. Application developers control the granularity and fault-free overhead in this effort. Performance and scalability results of the middleware prototype are being demonstrated principally through compact applications that relate to real use cases of practical and academic interest. The impact of this work ranges from users of the largest supercomputers in government labs to practical clusters that have long-running, time-critical applications, and to space-based and other parallel processing in "hostile" environments where faults occur more frequently than in past years. The project is producing usable free software that will be widely shared in the community as well as guidance on how better parallel programs can be written in academia, industry, and government. The project also provides guidelines for how to update existing or legacy programs to use the new capabilities that are being reduced to practice.
基于MPI的并行编程在学术界,政府(防御和非防御用途)以及可扩展机器学习和大数据分析中的新兴用途中使用。 新兴的超级计算机系统将具有更多的故障,MPI需要能够解决此类故障,以适合这些新兴情况,而不是导致整个应用程序失败。 在新的和即将推出的可扩展系统中对高性能和可行的并行编程至关重要的高性能计算(HPC)的协作,变革性消息传递研究(HPC)正在减少到“最佳实践优先,标准化者”的策略。消息传递接口(MPI-3/4)应用程序程序员接口的大量子集通过具有弱集体交易的扩展来使空中容错,并在并行任务之间同步。这项研究研究了新的模型,该模型定位故障,提供可调的无故障开销,允许多种故障,实现层次恢复,并且与数据并行相关。 正在研究基础网络的故障建模。应用程序开发人员在这项工作中控制了粒度和无故障开销。中间件原型的性能和可伸缩性结果主要是通过与实用和学术兴趣的实际用例相关的紧凑应用来证明的。这项工作的影响范围从政府实验室中最大的超级计算机的用户到具有长期运行,关键时间应用程序的实用集群以及在“敌对”环境中的空间基础和其他并行处理,在这些环境中,故障频率比过去几年更频繁。 该项目正在生产可用的免费软件,该软件将在社区中广泛共享,并提供有关如何在学术界,工业和政府中编写更好的并行计划的指导。 该项目还提供了有关如何更新现有或遗留程序以使用正在练习的新功能的准则。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Design of a Portable Implementation of Partitioned Point-to-Point Communication Primitives
分区点对点通信原语的便携式实现的设计
- DOI:10.1145/3458744.3474046
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Worley, Andrew;Prema Soundararajan, Prema;Schafer, Derek;Bangalore, Purushotham;Grant, Ryan;Dosanjh, Matthew;Skjellum, Anthony;Ghafoor, Sheikh
- 通讯作者:Ghafoor, Sheikh
共 1 条
- 1
Anthony Skjellum其他文献
Understanding GPU Triggering APIs for MPI+X Communication
了解用于 MPI X 通信的 GPU 触发 API
- DOI:
- 发表时间:20242024
- 期刊:
- 影响因子:0
- 作者:Patrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. BangalorePatrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. Bangalore
- 通讯作者:P. BangaloreP. Bangalore
共 1 条
- 1
Anthony Skjellum的其他基金
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:24121822412182
- 财政年份:2023
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:24051422405142
- 财政年份:2023
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Beginnings: Creating and Sustaining a Diverse Community of Expertise in Quantum Information Science (EQUIS) Across the Southeastern United States
起点:在美国东南部创建并维持一个多元化的量子信息科学 (EQUIS) 专业社区
- 批准号:24144612414461
- 财政年份:2023
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Cooperative AgreementCooperative Agreement
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:21510202151020
- 财政年份:2022
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
- 批准号:19255981925598
- 财政年份:2019
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:19189871918987
- 财政年份:2019
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: Software Engineering Workforce Development in High Performance Computing for Digital Twins
协作研究:数字孪生高性能计算中的软件工程劳动力开发
- 批准号:19356281935628
- 财政年份:2019
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
- 批准号:19256031925603
- 财政年份:2019
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: CICI: Regional: SouthEast SciEntific Cybersecurity for University Research (SouthEast SECURE)
合作研究:CICI:区域:东南大学研究科学网络安全 (SouthEast SECURE)
- 批准号:18124041812404
- 财政年份:2017
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
SHF: Small: Collaborative Research: Coupling Computation and Communication in FPGA-Enhanced Clouds and Clusters
SHF:小型:协作研究:FPGA 增强型云和集群中的耦合计算和通信
- 批准号:18214311821431
- 财政年份:2017
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
相似国自然基金
复合低维拓扑材料中等离激元增强光学响应的研究
- 批准号:12374288
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
基于管理市场和干预分工视角的消失中等企业:特征事实、内在机制和优化路径
- 批准号:72374217
- 批准年份:2023
- 资助金额:41.00 万元
- 项目类别:面上项目
托卡马克偏滤器中等离子体的多尺度算法与数值模拟研究
- 批准号:12371432
- 批准年份:2023
- 资助金额:43.5 万元
- 项目类别:面上项目
中等质量黑洞附近的暗物质分布及其IMRI系统引力波回波探测
- 批准号:12365008
- 批准年份:2023
- 资助金额:32 万元
- 项目类别:地区科学基金项目
中等垂直风切变下非对称型热带气旋快速增强的物理机制研究
- 批准号:42305004
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:24031342403134
- 财政年份:2024
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
- 批准号:24028042402804
- 财政年份:2024
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:24034082403408
- 财政年份:2024
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
- 批准号:24238132423813
- 财政年份:2024
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:24028062402806
- 财政年份:2024
- 资助金额:$ 52.37万$ 52.37万
- 项目类别:Standard GrantStandard Grant