SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications
SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模
基本信息
- 批准号:1900888
- 负责人:
- 金额:$ 91.57万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2019
- 资助国家:美国
- 起止时间:2019-08-01 至 2025-07-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Nondeterminism (i.e., the properties of a scientific application to exhibit different behaviors in numerical results and execution patterns during multiple executions) is an increasingly entrenched property of high performance computing (HPC) applications as the scientific community is moving their simulations on larger and highly heterogeneous computing systems. Nondeterminism can drastically increase the cost of scientific reproducibility in terms of developer time and computational resources, debugging applications when moving from a smaller to a larger scale or from one platform to another, and ensuring fault-tolerance when executions may need to recover from a system fault. These three challenges can ultimately compromise the amount and quality of scientific discovery through computer simulations. Tools for addressing aspects of the nondeterministic problem have emerged, including Record-and-replay (R&R) techniques that monitor and record changes in program states over one execution (i.e., the recorded execution) of an application; and reproduce those changes, and thus, the behavior of the application during a subsequent execution (i.e., the replayed execution). However, these tools impose overheads on the underlying application and thus present HPC users with the problem of balancing tool utility against tool overhead. HPC users may opt to not use the tool at all rather than deal with unpredictable overheads. This project supports HPC users by modeling the relationship between application nondeterminism and variability in tool overhead, and uses this knowledge to identify hot spots in terms of tool cost as well as regions in executions that trigger nondeterministic behaviors in the applications. The aim of the project is to model nondeterministic executions by determining points (motif) of nondeterminism in executions of HPC applications and to apply the motif modeling with R&R techniques, to study the cost on R&R techniques of certain motifs. The outcome of this project impacts four communities of application developers with the identification of sources of unintended nondeterminism and their management; the HPC research community working on fault-tolerance, resilience, and reproducibility at exascale; data center administrators who use evaluation tools for and with application developers; and educators and trainers in resource constrained environments to promote HPC without the need of accessing high-end, expensive computers.This project advances the study of nondeterministic HPC applications by studying the recording costs of Record-and-replay (R&R) tools and by defining strategy so that these tools can scale to the exascale domain. In addition to the more commonly studied factors of time and memory overhead, the project integrates power usage in the modeling. The project relies on graph theory to develop expressive and scalable graph-based representations of the dependencies between events in a program, and develops algorithms to identify motifs in the graph that indicate points of nondeterminism. These motifs are applied to quantify the associated costs of nondeterminism, including developing metrics to measure dissimilarities between different executions, modeling the costs of recording executions and assessing the overhead of recordings. Based on these motifs, work on this project generates ?fingerprints? (i.e., a holistic characterization of how and where nondeterminism manifests during the application executions) of real world HPC applications including N-Body problems (e.g., simulating particle, atomic, and planetary interactions); (2) Graph analytics (e.g., Graph500 benchmark); (3) Bioinformatics (e.g., mpiBLAST); and (4) Task-based data analysis application (e.g., WordCount, Join, Octree Clustering on top of MapReduce Over MPI frameworks). The fingerprints illuminate previously-overlooked similarities between the nondeterminism that manifests across multiple classes of applications and allow users to probe the relationship between process communication patterns, the motifs of the actual resulting executions, and the regions of those executions in which tool overhead accumulates for nondeterministic HPC applications.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
随着科学界将其模拟转移到更大且高度异构的平台上,非确定性(即科学应用程序在多次执行期间在数值结果和执行模式中表现出不同行为的属性)是高性能计算 (HPC) 应用程序日益根深蒂固的属性计算系统。不确定性会极大地增加科学再现性的成本,包括开发人员时间和计算资源、从较小规模转移到较大规模或从一个平台转移到另一个平台时调试应用程序,以及在执行可能需要从系统恢复时确保容错性过错。这三个挑战最终可能会影响通过计算机模拟进行的科学发现的数量和质量。 用于解决非确定性问题各方面的工具已经出现,包括记录和重放 (R&R) 技术,用于监视和记录应用程序的一次执行(即记录的执行)期间程序状态的变化;并重现这些更改,从而重现应用程序在后续执行期间的行为(即重播执行)。然而,这些工具给底层应用程序带来了开销,从而给 HPC 用户带来了平衡工具实用性和工具开销的问题。 HPC 用户可能选择根本不使用该工具,而不是处理不可预测的开销。该项目通过对应用程序不确定性和工具开销可变性之间的关系进行建模来支持 HPC 用户,并利用这些知识来识别工具成本方面的热点以及触发应用程序中不确定性行为的执行区域。该项目的目的是通过确定 HPC 应用程序执行中的非确定性点(主题)来对非确定性执行进行建模,并将主题建模与 R&R 技术相结合,以研究某些主题的 R&R 技术的成本。该项目的成果通过识别意外非确定性的来源及其管理来影响四个应用程序开发人员社区; HPC 研究社区致力于百亿亿级的容错性、弹性和可重复性;为应用程序开发人员使用评估工具并与应用程序开发人员一起使用评估工具的数据中心管理员;教育工作者和培训师在资源有限的环境中推广 HPC,而无需访问高端、昂贵的计算机。该项目通过研究记录和重放 (R&R) 工具的记录成本并定义策略,以便这些工具可以扩展到百亿亿级领域。除了更常见的研究时间和内存开销因素之外,该项目还在建模中集成了功耗。该项目依靠图论来开发程序中事件之间依赖关系的富有表现力和可扩展的基于图的表示,并开发算法来识别图中指示非确定性点的主题。这些主题用于量化不确定性的相关成本,包括开发衡量不同执行之间差异的指标、对记录执行的成本进行建模以及评估记录的开销。基于这些主题,该项目的工作产生了“指纹”? 现实世界的 HPC 应用程序(包括 N 体问题(例如,模拟粒子、原子和行星相互作用))(即应用程序执行过程中非确定性如何以及在何处表现的整体特征); (2)图分析(例如Graph500基准); (3)生物信息学(例如mpiBLAST); (4) 基于任务的数据分析应用程序(例如,基于 MPI 框架的 MapReduce 上的 WordCount、Join、八叉树聚类)。这些指纹揭示了以前被忽视的跨多个应用程序类别的非确定性之间的相似性,并允许用户探索进程通信模式、实际结果执行的主题以及工具开销累积为非确定性的执行区域之间的关系。 HPC 应用。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Research-Based Course Module to Study Non-determinism in High Performance Applications
用于研究高性能应用中的非确定性的研究型课程模块
- DOI:10.1109/ipdpsw55747.2022.00067
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Bell, Patrick;Suarez, Kae;Fossum, Barbara;Chapp, Dylan;Bhowmick, Sanjukta;Taufer, Michela
- 通讯作者:Taufer, Michela
ANACIN-X: A software framework for studying non-determinism in MPI applications
ANACIN-X:用于研究 MPI 应用中的非确定性的软件框架
- DOI:10.1016/j.simpa.2021.100151
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Bell, Patrick;Suarez, Kae;Chapp, Dylan;Tan, Nigel;Bhowmick, Sanjukta;Taufer, Michela
- 通讯作者:Taufer, Michela
A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing
- DOI:10.1177/10943420231166610
- 发表时间:2023-04-05
- 期刊:
- 影响因子:3.1
- 作者:Bhowmick,Sanjukta;Bell,Patrick;Taufer,Michela
- 通讯作者:Taufer,Michela
Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph Kernels
通过图内核识别 MPI 应用中非确定性的程度和来源
- DOI:10.1109/tpds.2021.3081530
- 发表时间:2021
- 期刊:
- 影响因子:5.3
- 作者:Chapp, Dylan;Tan, Nigel;Bhowmick, Sanjukta;Taufer, Michela
- 通讯作者:Taufer, Michela
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Michela Taufer其他文献
Enhancing Scientific Research with FAIR Digital Objects in the National Science Data Fabric
利用国家科学数据结构中的 FAIR 数字对象加强科学研究
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Michela Taufer;Heberth Martinez;Jakob Luettgau;Lauren Whitnah;G. Scorzelli;P. Newell;Aashish Panta;P. Bremer;Douglas Fils;Christine R. Kirkpatrick;V. Pascucci;Kathryn Mohror;J. Shalf - 通讯作者:
J. Shalf
Integrating FAIR Digital Objects (FDOs) into the National Science Data Fabric (NSDF) to Revolutionize Dataflows for Scientific Discovery
将 FAIR 数字对象 (FDO) 集成到国家科学数据结构 (NSDF) 中,彻底改变科学发现的数据流
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Michela Taufer;Heberth Martinez;Jakob Luettgau;Lauren Whitnah;†. GiorgioScorzelli;†. PaniaNewel;Aashish Panta;Timo Bremer;§. DougFils;¶. ChristineR.Kirkpatrick;Nina McCurdy;V. Pascucci;U. Knoxville;†. U.Utah;R. LLNL ‡;Research Center - 通讯作者:
Research Center
Michela Taufer的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Michela Taufer', 18)}}的其他基金
EAGER: A Comprehensive Approach for Generating, Sharing, Searching, and Using High-Resolution Terrain Parameters
EAGER:生成、共享、搜索和使用高分辨率地形参数的综合方法
- 批准号:
2334945 - 财政年份:2023
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Model-driven Design and Optimization of Dataflows for Scientific Applications
协作研究:SHF:小型:科学应用数据流的模型驱动设计和优化
- 批准号:
2331152 - 财政年份:2023
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
SHF: Small: Methods, Workflows, and Data Commons for Reducing Training Costs in Neural Architecture Search on High-Performance Computing Platforms
SHF:小型:降低高性能计算平台上神经架构搜索训练成本的方法、工作流程和数据共享
- 批准号:
2223704 - 财政年份:2022
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: Elements: SENSORY: Software Ecosystem for kNowledge diScOveRY - a data-driven framework for soil moisture applications
协作研究:要素:SENSORY:知识发现的软件生态系统 - 土壤湿度应用的数据驱动框架
- 批准号:
2103845 - 财政年份:2021
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: PPoSS: Planning: Performance Scalability, Trust, and Reproducibility: A Community Roadmap to Robust Science in High-throughput Applications
协作研究:PPoSS:规划:性能可扩展性、信任和可重复性:高通量应用中稳健科学的社区路线图
- 批准号:
2028923 - 财政年份:2020
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Advancing Reproducibility in Multi-Messenger Astrophysics
合作研究:EAGER:提高多信使天体物理学的可重复性
- 批准号:
2041977 - 财政年份:2020
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative: EAGER: Exploring and Advancing the State of the Art in Robust Science in Gravitational Wave Physics
合作:EAGER:探索和推进引力波物理学稳健科学的最新技术
- 批准号:
1841399 - 财政年份:2018
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative: EAGER: Exploring and Advancing the State of the Art in Robust Science in Gravitational Wave Physics
合作:EAGER:探索和推进引力波物理学稳健科学的最新技术
- 批准号:
1823372 - 财政年份:2018
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
SHF:Medium:Collaborative Research:A comprehensive methodology to pursue reproducible accuracy in ensemble scientific simulations on multi- and many-core platforms
SHF:中:协作研究:在多核和众核平台上追求集合科学模拟的可重复精度的综合方法
- 批准号:
1841552 - 财政年份:2018
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
BIGDATA: IA: Collaborative Research: In Situ Data Analytics for Next Generation Molecular Dynamics Workflows
BIGDATA:IA:协作研究:下一代分子动力学工作流程的原位数据分析
- 批准号:
1841758 - 财政年份:2018
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
相似国自然基金
复合低维拓扑材料中等离激元增强光学响应的研究
- 批准号:12374288
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
中等垂直风切变下非对称型热带气旋快速增强的物理机制研究
- 批准号:42305004
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于挥发性分布和氧化校正的大气半/中等挥发性有机物来源解析方法构建
- 批准号:42377095
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
基于机器学习和经典电动力学研究中等尺寸金属纳米粒子的量子表面等离激元
- 批准号:22373002
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
托卡马克偏滤器中等离子体的多尺度算法与数值模拟研究
- 批准号:12371432
- 批准年份:2023
- 资助金额:43.5 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403134 - 财政年份:2024
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
- 批准号:
2402804 - 财政年份:2024
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403408 - 财政年份:2024
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
- 批准号:
2423813 - 财政年份:2024
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402806 - 财政年份:2024
- 资助金额:
$ 91.57万 - 项目类别:
Standard Grant