SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications

SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模

基本信息

  • 批准号:
    1900888
  • 负责人:
  • 金额:
    $ 91.57万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-08-01 至 2025-07-31
  • 项目状态:
    未结题

项目摘要

Nondeterminism (i.e., the properties of a scientific application to exhibit different behaviors in numerical results and execution patterns during multiple executions) is an increasingly entrenched property of high performance computing (HPC) applications as the scientific community is moving their simulations on larger and highly heterogeneous computing systems. Nondeterminism can drastically increase the cost of scientific reproducibility in terms of developer time and computational resources, debugging applications when moving from a smaller to a larger scale or from one platform to another, and ensuring fault-tolerance when executions may need to recover from a system fault. These three challenges can ultimately compromise the amount and quality of scientific discovery through computer simulations. Tools for addressing aspects of the nondeterministic problem have emerged, including Record-and-replay (R&R) techniques that monitor and record changes in program states over one execution (i.e., the recorded execution) of an application; and reproduce those changes, and thus, the behavior of the application during a subsequent execution (i.e., the replayed execution). However, these tools impose overheads on the underlying application and thus present HPC users with the problem of balancing tool utility against tool overhead. HPC users may opt to not use the tool at all rather than deal with unpredictable overheads. This project supports HPC users by modeling the relationship between application nondeterminism and variability in tool overhead, and uses this knowledge to identify hot spots in terms of tool cost as well as regions in executions that trigger nondeterministic behaviors in the applications. The aim of the project is to model nondeterministic executions by determining points (motif) of nondeterminism in executions of HPC applications and to apply the motif modeling with R&R techniques, to study the cost on R&R techniques of certain motifs. The outcome of this project impacts four communities of application developers with the identification of sources of unintended nondeterminism and their management; the HPC research community working on fault-tolerance, resilience, and reproducibility at exascale; data center administrators who use evaluation tools for and with application developers; and educators and trainers in resource constrained environments to promote HPC without the need of accessing high-end, expensive computers.This project advances the study of nondeterministic HPC applications by studying the recording costs of Record-and-replay (R&R) tools and by defining strategy so that these tools can scale to the exascale domain. In addition to the more commonly studied factors of time and memory overhead, the project integrates power usage in the modeling. The project relies on graph theory to develop expressive and scalable graph-based representations of the dependencies between events in a program, and develops algorithms to identify motifs in the graph that indicate points of nondeterminism. These motifs are applied to quantify the associated costs of nondeterminism, including developing metrics to measure dissimilarities between different executions, modeling the costs of recording executions and assessing the overhead of recordings. Based on these motifs, work on this project generates ?fingerprints? (i.e., a holistic characterization of how and where nondeterminism manifests during the application executions) of real world HPC applications including N-Body problems (e.g., simulating particle, atomic, and planetary interactions); (2) Graph analytics (e.g., Graph500 benchmark); (3) Bioinformatics (e.g., mpiBLAST); and (4) Task-based data analysis application (e.g., WordCount, Join, Octree Clustering on top of MapReduce Over MPI frameworks). The fingerprints illuminate previously-overlooked similarities between the nondeterminism that manifests across multiple classes of applications and allow users to probe the relationship between process communication patterns, the motifs of the actual resulting executions, and the regions of those executions in which tool overhead accumulates for nondeterministic HPC applications.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
非确定性(即,在多次执行过程中在数值结果和执行模式中表现出不同行为的科学应用的特性)是高性能计算应用程序(HPC)应用程序越来越根深蒂固的属性,因为科学界正在将其模拟在较大且高度异构的计算系统上移动。非确定性可以大大提高开发人员时间和计算资源的科学可重复性成本,从较小规模或从一个平台转移到另一个平台时进行调试应用程序,并在执行可能需要从系统故障中恢复时,确保耐受性容忍。这三个挑战最终可以通过计算机模拟损害科学发现的数量和质量。 已经出现了用于解决非确定问题方面的工具,包括记录和复制(R&R)技术,这些技术在一个应用程序的一个执行(即记录的执行)中监视和记录程序状态的变化;并重现这些更改,因此,在随后的执行过程中应用程序的行为(即重播执行)。但是,这些工具在基础应用程序上强加了开销,因此向HPC用户呈现了平衡工具实用程序与工具开销的问题。 HPC用户可能会选择根本不使用该工具,而不是处理不可预测的开销。该项目通过建模应用程序非确定性和工具间接费用之间的关系来支持HPC用户,并使用此知识来识别刀具成本以及执行区域的热点,这些区域在应用程序中触发非确定行为的执行区域。该项目的目的是通过确定HPC应用程序执行中的非确定性点(主题)来对非确定性执行进行建模,并使用R&R技术应用主题建模,以研究某些主题的R&R技术成本。该项目的结果影响了四个应用程序开发人员社区,并确定了意外的非确定性及其管理的来源; HPC研究社区在Exascale致力于耐断层,弹性和可重复性。为应用程序开发人员使用评估工具的数据中心管理员;以及资源约束环境中的教育工作者和培训师,以促进HPC而无需访问高端,昂贵的计算机。该项目通过研究记录和复兴(R&R)工具的录制成本(R&R)工具的录制成本并定义策略,以使这些工具可以扩展到Exascale域。除了更常见的时间和内存开销因素外,该项目还将功率使用集成在建模中。该项目依靠图理论来开发程序中事件之间依赖关系的表达性和可扩展图的表示,并开发算法以识别图表中指示非确定点的图案。这些图案用于量化非确定性的相关成本,包括开发指标以衡量不同执行之间的差异,对记录执行的成本进行建模并评估录音的开销。基于这些图案,该项目的工作会生成“指纹”吗? (即,在应用程序执行过程中非确定性表现出的方​​式和何处的整体表征)现实世界中的HPC应用程序,包括n体问题(例如,模拟粒子,原子和行星相互作用); (2)Graph Analytics(例如Graph500基准); (3)生物信息学(例如Mpiblast); (4)基于任务的数据分析应用程序(例如WordCount,JOIN,OCTREE在MAPREDUCE之上与MPI框架之上的聚类)。指纹阐明了以前被忽视的非确定性之间的相似性,在多个类别的应用程序之间表现出来,并允许用户探测过程沟通模式之间的关系,实际结果执行的实际执行主题,实际执行的区域以及这些执行的区域以及工具架设的工具额外累积的HPC应用程序的累积。基金会的智力优点和更广泛的影响评论标准。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Research-Based Course Module to Study Non-determinism in High Performance Applications
用于研究高性能应用中的非确定性的研究型课程模块
ANACIN-X: A software framework for studying non-determinism in MPI applications
ANACIN-X:用于研究 MPI 应用中的非确定性的软件框架
  • DOI:
    10.1016/j.simpa.2021.100151
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Bell, Patrick;Suarez, Kae;Chapp, Dylan;Tan, Nigel;Bhowmick, Sanjukta;Taufer, Michela
  • 通讯作者:
    Taufer, Michela
A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing
Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph Kernels
通过图内核识别 MPI 应用中非确定性的程度和来源
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Michela Taufer其他文献

Enhancing Scientific Research with FAIR Digital Objects in the National Science Data Fabric
利用国家科学数据结构中的 FAIR 数字对象加强科学研究
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Michela Taufer;Heberth Martinez;Jakob Luettgau;Lauren Whitnah;G. Scorzelli;P. Newell;Aashish Panta;P. Bremer;Douglas Fils;Christine R. Kirkpatrick;V. Pascucci;Kathryn Mohror;J. Shalf
  • 通讯作者:
    J. Shalf
Integrating FAIR Digital Objects (FDOs) into the National Science Data Fabric (NSDF) to Revolutionize Dataflows for Scientific Discovery
将 FAIR 数字对象 (FDO) 集成到国家科学数据结构 (NSDF) 中,彻底改变科学发现的数据流
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Michela Taufer;Heberth Martinez;Jakob Luettgau;Lauren Whitnah;†. GiorgioScorzelli;†. PaniaNewel;Aashish Panta;Timo Bremer;§. DougFils;¶. ChristineR.Kirkpatrick;Nina McCurdy;V. Pascucci;U. Knoxville;†. U.Utah;R. LLNL ‡;Research Center
  • 通讯作者:
    Research Center

Michela Taufer的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Michela Taufer', 18)}}的其他基金

EAGER: A Comprehensive Approach for Generating, Sharing, Searching, and Using High-Resolution Terrain Parameters
EAGER:生成、共享、搜索和使用高分辨率地形参数的综合方法
  • 批准号:
    2334945
  • 财政年份:
    2023
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Model-driven Design and Optimization of Dataflows for Scientific Applications
协作研究:SHF:小型:科学应用数据流的模型驱动设计和优化
  • 批准号:
    2331152
  • 财政年份:
    2023
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
SHF: Small: Methods, Workflows, and Data Commons for Reducing Training Costs in Neural Architecture Search on High-Performance Computing Platforms
SHF:小型:降低高性能计算平台上神经架构搜索训练成本的方法、工作流程和数据共享
  • 批准号:
    2223704
  • 财政年份:
    2022
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: SENSORY: Software Ecosystem for kNowledge diScOveRY - a data-driven framework for soil moisture applications
协作研究:要素:SENSORY:知识发现的软件生态系统 - 土壤湿度应用的数据驱动框架
  • 批准号:
    2103845
  • 财政年份:
    2021
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: PPoSS: Planning: Performance Scalability, Trust, and Reproducibility: A Community Roadmap to Robust Science in High-throughput Applications
协作研究:PPoSS:规划:性能可扩展性、信任和可重复性:高通量应用中稳健科学的社区路线图
  • 批准号:
    2028923
  • 财政年份:
    2020
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Advancing Reproducibility in Multi-Messenger Astrophysics
合作研究:EAGER:提高多信使天体物理学的可重复性
  • 批准号:
    2041977
  • 财政年份:
    2020
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative: EAGER: Exploring and Advancing the State of the Art in Robust Science in Gravitational Wave Physics
合作:EAGER:探索和推进引力波物理学稳健科学的最新技术
  • 批准号:
    1841399
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative: EAGER: Exploring and Advancing the State of the Art in Robust Science in Gravitational Wave Physics
合作:EAGER:探索和推进引力波物理学稳健科学的最新技术
  • 批准号:
    1823372
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
SHF:Medium:Collaborative Research:A comprehensive methodology to pursue reproducible accuracy in ensemble scientific simulations on multi- and many-core platforms
SHF:中:协作研究:在多核和众核平台上追求集合科学模拟的可重复精度的综合方法
  • 批准号:
    1841552
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
BIGDATA: IA: Collaborative Research: In Situ Data Analytics for Next Generation Molecular Dynamics Workflows
BIGDATA:IA:协作研究:下一代分子动力学工作流程的原位数据分析
  • 批准号:
    1841758
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant

相似国自然基金

复合低维拓扑材料中等离激元增强光学响应的研究
  • 批准号:
    12374288
  • 批准年份:
    2023
  • 资助金额:
    52 万元
  • 项目类别:
    面上项目
基于管理市场和干预分工视角的消失中等企业:特征事实、内在机制和优化路径
  • 批准号:
    72374217
  • 批准年份:
    2023
  • 资助金额:
    41.00 万元
  • 项目类别:
    面上项目
托卡马克偏滤器中等离子体的多尺度算法与数值模拟研究
  • 批准号:
    12371432
  • 批准年份:
    2023
  • 资助金额:
    43.5 万元
  • 项目类别:
    面上项目
中等质量黑洞附近的暗物质分布及其IMRI系统引力波回波探测
  • 批准号:
    12365008
  • 批准年份:
    2023
  • 资助金额:
    32 万元
  • 项目类别:
    地区科学基金项目
中等垂直风切变下非对称型热带气旋快速增强的物理机制研究
  • 批准号:
    42305004
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
  • 批准号:
    2402804
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403408
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
  • 批准号:
    2423813
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
  • 批准号:
    2402806
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了