Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
基本信息
- 批准号:2405142
- 负责人:
- 金额:$ 7.45万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2024-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Advances throughout science and engineering have for several decades been driven by High Performance Computing (HPC), with the pace of discovery accelerating in concert with continued innovation in computing capabilities. But as semiconductor technology now faces fundamental physical limits, even while large-scale systems are reaching warehouse scales, new approaches are becoming essential to achieving efficient use of computing resources. In particular, given this divergence of scales, HPC systems have necessarily become more distributed and asynchronous (in the sense that system clocks are asynchronous), resulting in increasingly variable and unpredictable execution. While these effects are recognized as critical hindrances to HPC performance, the mechanisms are not yet fully understood. What is known, however, is that much HPC infrastructure is tasked with dealing with inefficiency derived from asynchrony, variability, and unpredictability, leading to a deep and complex hardware/software support stack. The project team's hypothesis is that while each stack element provides a local solution, it may also exacerbate the global problem: that complexity has resulted in more variability, not less, and made determining its causes more difficult. This project explores the possibility of reversing the trend of ever-increasing complexity by removing and simplifying support layers. This strategy’s achievable gains remain limited, however, while the underlying cause, execution asynchrony, remains unaddressed. The approach begins by leveraging recently developed technology that enables clocks to remain extremely accurate even when distributed on a planetary scale. Such accurate, distributed clocks serve to underpin a virtuous cycle where synchrony establishes baseline predictability, which, in turn, reduces variability, and at each stage of the cycle enables reduction in the complexity of the support stack. A benefit of this approach is that the individual steps are largely simple and can be applied directly to existing software systems. This one-year project aims to obtain early findings and practical demonstrations for the importance of synchrony and predictability to increase HPC compute efficiency and thereby improve large-scale program execution. Five tasks are conducted. The first is to demonstrate the feasibility of accurate clock distribution by augmenting existing HPC network infrastructure. The second is to demonstrate the application of synchrony in the establishing a virtuous cycle enabling simplifications to the software/system support stack. The third is to devise mechanisms to model, measure, and validate systems using the proposed methods. The fourth is to investigate the relative benefits of applying the synchrony-based virtuous cycle with respect to various application classes. The fifth is to demonstrate the overall efficacy of the proposed approach through a case study involving a production application. Overall, the project works to determine whether added synchronization through accurate clocks enables significant improvements to HPC computations in terms of how efficiently they use computational resources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
数十年来,整个科学和工程学的进步都是由高性能计算(HPC)驱动的,并且发现加速的空间与计算能力中的持续创新一起。但是,由于半导体技术现在面临着基本的物理限制,即使大规模系统达到仓库量表,新方法也成为有效利用计算资源的有效使用。特别是,鉴于量表的差异,HPC系统有必要变得更加分布和异步(从某种意义上说,系统时钟是异步的),从而导致越来越多的可变且无法预测的执行。尽管这些影响被认为是HPC性能的关键障碍,但这些机制尚未完全理解。但是,众所周知,许多HPC基础架构的任务是处理来自异步,可变性和不可预测性的低效率,从而导致了深度且复杂的硬件/软件支持堆栈。项目团队的假设是,尽管每个堆栈元素都提供了本地解决方案,但它也可能加剧了全球问题:复杂性导致了更大的可变性,而不是更少,并使确定其原因更加困难。该项目探讨了通过删除和简化支持层来逆转不断增加复杂性趋势的可能性。然而,该策略的可实现的收益仍然有限,但是,基本的原因,执行异步仍然没有解决。该方法首先利用最近开发的技术,该技术即使在行星尺度上分发,也能使时钟保持非常准确。这种准确的分布式时钟用于支撑一个虚拟周期,其中同步建立基线可预测性,从而降低了可变性,并且在周期的每个阶段,可以降低支撑堆栈的复杂性。这种方法的一个好处是,各个步骤在很大程度上很简单,并且可以直接应用于现有的软件系统。这个为期一年的项目旨在获得同步和可预测性提高HPC计算效率的重要性的早期发现和实践证明,从而改善了大规模计划的执行。进行了五项任务。首先是通过增强现有的HPC网络基础架构来证明准确时钟分布的可行性。第二个是证明同步在建立良性周期中的应用,从而使软件/系统支持堆栈简化。第三个是设计机制,以使用所提出的方法对系统进行建模,测量和验证系统。第四个是研究针对各种应用程序类别应用基于同步的良性周期的相对好处。第五是通过涉及生产应用的案例研究来证明所提出方法的总体效率。总体而言,该项目致力于确定是否通过准确时钟增加同步,以有效地使用计算资源来对HPC计算进行重大改进。该奖项反映了NSF的法定任务,并被认为是通过基金会的智力优点和更广泛的影响审查标准通过评估来评估的支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Anthony Skjellum其他文献
Understanding GPU Triggering APIs for MPI+X Communication
了解用于 MPI X 通信的 GPU 触发 API
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Patrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. Bangalore - 通讯作者:
P. Bangalore
Anthony Skjellum的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Anthony Skjellum', 18)}}的其他基金
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
2412182 - 财政年份:2023
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Beginnings: Creating and Sustaining a Diverse Community of Expertise in Quantum Information Science (EQUIS) Across the Southeastern United States
起点:在美国东南部创建并维持一个多元化的量子信息科学 (EQUIS) 专业社区
- 批准号:
2414461 - 财政年份:2023
- 资助金额:
$ 7.45万 - 项目类别:
Cooperative Agreement
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:
2151020 - 财政年份:2022
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
- 批准号:
1925598 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
1918987 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: Software Engineering Workforce Development in High Performance Computing for Digital Twins
协作研究:数字孪生高性能计算中的软件工程劳动力开发
- 批准号:
1935628 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
- 批准号:
1925603 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: CICI: Regional: SouthEast SciEntific Cybersecurity for University Research (SouthEast SECURE)
合作研究:CICI:区域:东南大学研究科学网络安全 (SouthEast SECURE)
- 批准号:
1812404 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
SHF: Medium: Collaborative Research: Next-Generation Message Passing for Parallel Programming: Resiliency, Time-to-Solution, Performance-Portability, Scalability, and QoS
SHF:中:协作研究:并行编程的下一代消息传递:弹性、解决时间、性能可移植性、可扩展性和 QoS
- 批准号:
1822191 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Continuing Grant
SHF: Small: Collaborative Research: Coupling Computation and Communication in FPGA-Enhanced Clouds and Clusters
SHF:小型:协作研究:FPGA 增强型云和集群中的耦合计算和通信
- 批准号:
1821431 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
相似国自然基金
支持二维毫米波波束扫描的微波/毫米波高集成度天线研究
- 批准号:62371263
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
腙的Heck/脱氮气重排串联反应研究
- 批准号:22301211
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
水系锌离子电池协同性能调控及枝晶抑制机理研究
- 批准号:52364038
- 批准年份:2023
- 资助金额:33 万元
- 项目类别:地区科学基金项目
基于人类血清素神经元报告系统研究TSPYL1突变对婴儿猝死综合征的致病作用及机制
- 批准号:82371176
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
FOXO3 m6A甲基化修饰诱导滋养细胞衰老效应在补肾法治疗自然流产中的机制研究
- 批准号:82305286
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: EAGER: IMPRESS-U: Groundwater Resilience Assessment through iNtegrated Data Exploration for Ukraine (GRANDE-U)
合作研究:EAGER:IMPRESS-U:通过乌克兰综合数据探索进行地下水恢复力评估 (GRANDE-U)
- 批准号:
2409395 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
- 批准号:
2347624 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: Revealing the Physical Mechanisms Underlying the Extraordinary Stability of Flying Insects
EAGER/合作研究:揭示飞行昆虫非凡稳定性的物理机制
- 批准号:
2344215 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
- 批准号:
2345581 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
- 批准号:
2345582 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant