Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
基本信息
- 批准号:2405142
- 负责人:
- 金额:$ 7.45万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2024-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Advances throughout science and engineering have for several decades been driven by High Performance Computing (HPC), with the pace of discovery accelerating in concert with continued innovation in computing capabilities. But as semiconductor technology now faces fundamental physical limits, even while large-scale systems are reaching warehouse scales, new approaches are becoming essential to achieving efficient use of computing resources. In particular, given this divergence of scales, HPC systems have necessarily become more distributed and asynchronous (in the sense that system clocks are asynchronous), resulting in increasingly variable and unpredictable execution. While these effects are recognized as critical hindrances to HPC performance, the mechanisms are not yet fully understood. What is known, however, is that much HPC infrastructure is tasked with dealing with inefficiency derived from asynchrony, variability, and unpredictability, leading to a deep and complex hardware/software support stack. The project team's hypothesis is that while each stack element provides a local solution, it may also exacerbate the global problem: that complexity has resulted in more variability, not less, and made determining its causes more difficult. This project explores the possibility of reversing the trend of ever-increasing complexity by removing and simplifying support layers. This strategy’s achievable gains remain limited, however, while the underlying cause, execution asynchrony, remains unaddressed. The approach begins by leveraging recently developed technology that enables clocks to remain extremely accurate even when distributed on a planetary scale. Such accurate, distributed clocks serve to underpin a virtuous cycle where synchrony establishes baseline predictability, which, in turn, reduces variability, and at each stage of the cycle enables reduction in the complexity of the support stack. A benefit of this approach is that the individual steps are largely simple and can be applied directly to existing software systems. This one-year project aims to obtain early findings and practical demonstrations for the importance of synchrony and predictability to increase HPC compute efficiency and thereby improve large-scale program execution. Five tasks are conducted. The first is to demonstrate the feasibility of accurate clock distribution by augmenting existing HPC network infrastructure. The second is to demonstrate the application of synchrony in the establishing a virtuous cycle enabling simplifications to the software/system support stack. The third is to devise mechanisms to model, measure, and validate systems using the proposed methods. The fourth is to investigate the relative benefits of applying the synchrony-based virtuous cycle with respect to various application classes. The fifth is to demonstrate the overall efficacy of the proposed approach through a case study involving a production application. Overall, the project works to determine whether added synchronization through accurate clocks enables significant improvements to HPC computations in terms of how efficiently they use computational resources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
几十年来,高性能计算 (HPC) 推动了整个科学和工程的进步,随着计算能力的不断创新,发现的步伐不断加快,但半导体技术现在面临着基本的物理限制,即使是在大规模系统中也是如此。正在达到仓库规模,新的方法对于实现计算资源的有效利用至关重要,特别是考虑到这种规模的差异,HPC 系统必然变得更加分布式和异步(从系统时钟异步的意义上来说),从而导致越来越多的变化。和不可预测的执行。虽然这些影响被认为是 HPC 性能的关键障碍,但我们尚未完全了解其机制,但众所周知,许多 HPC 基础设施的任务是处理因异步、可变性和不可预测性而导致的低效率问题。项目团队的假设是,虽然每个堆栈元素都提供了本地解决方案,但它也可能加剧全局问题:复杂性导致了更多的可变性,而不是更少,并使确定其原因变得更加困难。该项目探讨了通过消除和简化支持层来扭转日益增加的复杂性趋势的可能性,但是,该策略可实现的收益仍然有限,而执行异步这一根本原因仍未得到解决。该方法首先利用了最近开发的技术。即使分布在全球范围内,时钟也能保持极其精确,这种精确的分布式时钟可以支撑良性循环,在这种循环中,同步建立了基线可预测性,从而降低了可预测性。这种方法的一个好处是各个步骤非常简单,并且可以直接应用于现有的软件系统。早期发现和实际演示了同步和可预测性对于提高 HPC 计算效率并从而改善大规模程序执行的重要性。第一个任务是通过增强现有 HPC 网络基础设施来演示精确时钟分配的可行性。是为了演示同步的应用建立一个能够简化软件/系统支持堆栈的良性循环,第三是使用所提出的方法设计建模、测量和验证系统的机制,第四是研究应用基于同步的良性循环的相对好处。第五是通过涉及生产应用程序的案例研究来证明所提出方法的整体功效。总体而言,该项目致力于确定通过精确时钟添加同步是否能够显着改进 HPC 计算。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Anthony Skjellum其他文献
A Holistic Analysis of Internet of Things (IoT) Security: Principles, Practices, and New Perspectives
物联网 (IoT) 安全的整体分析:原理、实践和新视角
- DOI:
10.3390/fi16020040 - 发表时间:
2024-01-24 - 期刊:
- 影响因子:3.4
- 作者:
M. Hossain;Golam Kayas;Ragib Hasan;Anthony Skjellum;S. Noor;S. Islam - 通讯作者:
S. Islam
SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications
SmartFuse:可重新配置的智能交换机,以加速 HPC 应用中的融合集合
- DOI:
10.1145/3650200.3656616 - 发表时间:
2024-05-30 - 期刊:
- 影响因子:0
- 作者:
Pouya Haghi;Cheng Tan;Anqi Guo;Chunshu Wu;Dongfang Liu;Ang Li;Anthony Skjellum;Tong Geng;Martin C. Herbordt - 通讯作者:
Martin C. Herbordt
Extending the message passing interface (MPI)
扩展消息传递接口 (MPI)
- DOI:
10.1109/splc.1994.376998 - 发表时间:
1994-10-12 - 期刊:
- 影响因子:0
- 作者:
Anthony Skjellum;Nathan E. Doss;Kishore Viswanathan;Aswini Chowdappa;P. Bangalore - 通讯作者:
P. Bangalore
10 Über Message-Passing hinaus
10 Über 消息传递 hinaus
- DOI:
10.1524/9783486841008.259 - 发表时间:
2007-01-31 - 期刊:
- 影响因子:0
- 作者:
William Gropp;Ewing L. Lusk;Anthony Skjellum - 通讯作者:
Anthony Skjellum
Understanding GPU Triggering APIs for MPI+X Communication
了解用于 MPI X 通信的 GPU 触发 API
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Patrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. Bangalore - 通讯作者:
P. Bangalore
Anthony Skjellum的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Anthony Skjellum', 18)}}的其他基金
Beginnings: Creating and Sustaining a Diverse Community of Expertise in Quantum Information Science (EQUIS) Across the Southeastern United States
起点:在美国东南部创建并维持一个多元化的量子信息科学 (EQUIS) 专业社区
- 批准号:
2414461 - 财政年份:2023
- 资助金额:
$ 7.45万 - 项目类别:
Cooperative Agreement
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
2412182 - 财政年份:2023
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:
2151020 - 财政年份:2022
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
- 批准号:
1925603 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
- 批准号:
1925598 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
- 批准号:
1925603 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: Software Engineering Workforce Development in High Performance Computing for Digital Twins
协作研究:数字孪生高性能计算中的软件工程劳动力开发
- 批准号:
1935628 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
1918987 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
- 批准号:
1925598 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: CICI: Regional: SouthEast SciEntific Cybersecurity for University Research (SouthEast SECURE)
合作研究:CICI:区域:东南大学研究科学网络安全 (SouthEast SECURE)
- 批准号:
1812404 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
相似国自然基金
基于肿瘤病理图片的靶向药物敏感生物标志物识别及统计算法的研究
- 批准号:82304250
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
肠道普拉梭菌代谢物丁酸抑制心室肌铁死亡改善老龄性心功能不全的机制研究
- 批准号:82300430
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
社会网络关系对公司现金持有决策影响——基于共御风险的作用机制研究
- 批准号:72302067
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
面向图像目标检测的新型弱监督学习方法研究
- 批准号:62371157
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
面向开放域对话系统信息获取的准确性研究
- 批准号:62376067
- 批准年份:2023
- 资助金额:51 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
- 批准号:
2345582 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
- 批准号:
2347623 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
- 批准号:
2347624 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: IMPRESS-U: Groundwater Resilience Assessment through iNtegrated Data Exploration for Ukraine (GRANDE-U)
合作研究:EAGER:IMPRESS-U:通过乌克兰综合数据探索进行地下水恢复力评估 (GRANDE-U)
- 批准号:
2409395 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: The next crisis for coral reefs is how to study vanishing coral species; AUVs equipped with AI may be the only tool for the job
合作研究:EAGER:珊瑚礁的下一个危机是如何研究正在消失的珊瑚物种;
- 批准号:
2333604 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant