SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications

SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模

基本信息

  • 批准号:
    1900765
  • 负责人:
  • 金额:
    $ 31.62万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-08-01 至 2025-07-31
  • 项目状态:
    未结题

项目摘要

Nondeterminism (i.e., the properties of a scientific application to exhibit different behaviors in numerical results and execution patterns during multiple executions) is an increasingly entrenched property of high performance computing (HPC) applications as the scientific community is moving their simulations on larger and highly heterogeneous computing systems. Nondeterminism can drastically increase the cost of scientific reproducibility in terms of developer time and computational resources, debugging applications when moving from a smaller to a larger scale or from one platform to another, and ensuring fault-tolerance when executions may need to recover from a system fault. These three challenges can ultimately compromise the amount and quality of scientific discovery through computer simulations. Tools for addressing aspects of the nondeterministic problem have emerged, including Record-and-replay (R&R) techniques that monitor and record changes in program states over one execution (i.e., the recorded execution) of an application; and reproduce those changes, and thus, the behavior of the application during a subsequent execution (i.e., the replayed execution). However, these tools impose overheads on the underlying application and thus present HPC users with the problem of balancing tool utility against tool overhead. HPC users may opt to not use the tool at all rather than deal with unpredictable overheads. This project supports HPC users by modeling the relationship between application nondeterminism and variability in tool overhead, and uses this knowledge to identify hot spots in terms of tool cost as well as regions in executions that trigger nondeterministic behaviors in the applications. The aim of the project is to model nondeterministic executions by determining points (motif) of nondeterminism in executions of HPC applications and to apply the motif modeling with R&R techniques, to study the cost on R&R techniques of certain motifs. The outcome of this project impacts four communities of application developers with the identification of sources of unintended nondeterminism and their management; the HPC research community working on fault-tolerance, resilience, and reproducibility at exascale; data center administrators who use evaluation tools for and with application developers; and educators and trainers in resource constrained environments to promote HPC without the need of accessing high-end, expensive computers.This project advances the study of nondeterministic HPC applications by studying the recording costs of Record-and-replay (R&R) tools and by defining strategy so that these tools can scale to the exascale domain. In addition to the more commonly studied factors of time and memory overhead, the project integrates power usage in the modeling. The project relies on graph theory to develop expressive and scalable graph-based representations of the dependencies between events in a program, and develops algorithms to identify motifs in the graph that indicate points of nondeterminism. These motifs are applied to quantify the associated costs of nondeterminism, including developing metrics to measure dissimilarities between different executions, modeling the costs of recording executions and assessing the overhead of recordings. Based on these motifs, work on this project generates 'fingerprints' (i.e., a holistic characterization of how and where nondeterminism manifests during the application executions) of real world HPC applications including N-Body problems (e.g., simulating particle, atomic, and planetary interactions); (2) Graph analytics (e.g., Graph500 benchmark); (3) Bioinformatics (e.g., mpiBLAST); and (4) Task-based data analysis application (e.g., WordCount, Join, Octree Clustering on top of MapReduce Over MPI frameworks). The fingerprints illuminate previously-overlooked similarities between the nondeterminism that manifests across multiple classes of applications and allow users to probe the relationship between process communication patterns, the motifs of the actual resulting executions, and the regions of those executions in which tool overhead accumulates for nondeterministic HPC applications.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
随着科学界将其模拟转移到更大且高度异构的平台上,非确定性(即科学应用程序在多次执行期间在数值结果和执行模式中表现出不同行为的属性)是高性能计算 (HPC) 应用程序日益根深蒂固的属性计算系统。不确定性会极大地增加科学再现性的成本,包括开发人员时间和计算资源、从较小规模转移到较大规模或从一个平台转移到另一个平台时调试应用程序,以及在执行可能需要从系统恢复时确保容错性过错。这三个挑战最终可能会影响通过计算机模拟进行的科学发现的数量和质量。 用于解决非确定性问题各方面的工具已经出现,包括记录和重放 (R&R) 技术,用于监视和记录应用程序的一次执行(即记录的执行)期间程序状态的变化;并重现这些更改,从而重现应用程序在后续执行期间的行为(即重播执行)。然而,这些工具给底层应用程序带来了开销,从而给 HPC 用户带来了平衡工具实用性和工具开销的问题。 HPC 用户可能选择根本不使用该工具,而不是处理不可预测的开销。该项目通过对应用程序不确定性和工具开销可变性之间的关系进行建模来支持 HPC 用户,并利用这些知识来识别工具成本方面的热点以及触发应用程序中不确定性行为的执行区域。该项目的目的是通过确定 HPC 应用程序执行中的非确定性点(主题)来对非确定性执行进行建模,并将主题建模与 R&R 技术相结合,以研究某些主题的 R&R 技术的成本。该项目的成果通过识别意外非确定性的来源及其管理来影响四个应用程序开发人员社区; HPC 研究社区致力于百亿亿级的容错性、弹性和可重复性;为应用程序开发人员使用评估工具并与应用程序开发人员一起使用评估工具的数据中心管理员;教育工作者和培训师在资源有限的环境中推广 HPC,而无需访问高端、昂贵的计算机。该项目通过研究记录和重放 (R&R) 工具的记录成本并定义策略,以便这些工具可以扩展到百亿亿级领域。除了更常见的研究时间和内存开销因素之外,该项目还在建模中集成了功耗。该项目依靠图论来开发程序中事件之间依赖关系的富有表现力和可扩展的基于图的表示,并开发算法来识别图中指示非确定性点的主题。这些主题用于量化不确定性的相关成本,包括开发衡量不同执行之间差异的指标、对记录执行的成本进行建模以及评估记录的开销。基于这些主题,该项目的工作生成了现实世界 HPC 应用程序的“指纹”(即,非确定性在应用程序执行过程中如何以及在何处表现的整体特征),包括 N 体问题(例如,模拟粒子、原子和行星)互动); (2)图分析(例如Graph500基准); (3)生物信息学(例如mpiBLAST); (4) 基于任务的数据分析应用程序(例如,基于 MPI 框架的 MapReduce 上的 WordCount、Join、八叉树聚类)。这些指纹揭示了以前被忽视的跨多个应用程序类别的非确定性之间的相似性,并允许用户探索进程通信模式、实际结果执行的主题以及工具开销累积为非确定性的执行区域之间的关系。 HPC 应用。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Sanjukta Bhowmick其他文献

Sanjukta Bhowmick的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Sanjukta Bhowmick', 18)}}的其他基金

Collaborative Research: CCRI: Planning: A Multilayer Network (MLN) Community Infrastructure for Data,Interaction,Visualization, and softwarE(MLN-DIVE)
合作研究:CCRI:规划:数据、交互、可视化和软件的多层网络 (MLN) 社区基础设施 (MLN-DIVE)
  • 批准号:
    2120414
  • 财政年份:
    2021
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: Framework Implementations: CSSI: CANDY: Cyberinfrastructure for Accelerating Innovation in Network Dynamics
合作研究:框架实施:CSSI:CANDY:加速网络动态创新的网络基础设施
  • 批准号:
    2104076
  • 财政年份:
    2021
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: NetSplicer: Scalable Decoupling-based Algorithms for Multilayer Network Analysis
合作研究:SHF:中:NetSplicer:用于多层网络分析的可扩展的基于解耦的算法
  • 批准号:
    1956373
  • 财政年份:
    2020
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
XPS: EXPL: FP: Collaborative Research: SPANDAN: Scalable Parallel Algorithms for Network Dynamics Analysis
XPS:EXPL:FP:协作研究:SPANDAN:用于网络动态分析的可扩展并行算法
  • 批准号:
    1924486
  • 财政年份:
    2018
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: SANDY: Sparsification-Based Approach for Analyzing Network Dynamics
SPX:协作研究:SANDY:基于稀疏化的网络动态分析方法
  • 批准号:
    1916084
  • 财政年份:
    2018
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Continuing Grant
SPX: Collaborative Research: SANDY: Sparsification-Based Approach for Analyzing Network Dynamics
SPX:协作研究:SANDY:基于稀疏化的网络动态分析方法
  • 批准号:
    1725566
  • 财政年份:
    2017
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Continuing Grant
XPS: EXPL: FP: Collaborative Research: SPANDAN: Scalable Parallel Algorithms for Network Dynamics Analysis
XPS:EXPL:FP:协作研究:SPANDAN:用于网络动态分析的可扩展并行算法
  • 批准号:
    1533881
  • 财政年份:
    2015
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant

相似国自然基金

复合低维拓扑材料中等离激元增强光学响应的研究
  • 批准号:
    12374288
  • 批准年份:
    2023
  • 资助金额:
    52 万元
  • 项目类别:
    面上项目
中等垂直风切变下非对称型热带气旋快速增强的物理机制研究
  • 批准号:
    42305004
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于挥发性分布和氧化校正的大气半/中等挥发性有机物来源解析方法构建
  • 批准号:
    42377095
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
基于机器学习和经典电动力学研究中等尺寸金属纳米粒子的量子表面等离激元
  • 批准号:
    22373002
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
托卡马克偏滤器中等离子体的多尺度算法与数值模拟研究
  • 批准号:
    12371432
  • 批准年份:
    2023
  • 资助金额:
    43.5 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
  • 批准号:
    2402804
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403408
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
  • 批准号:
    2423813
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
  • 批准号:
    2402806
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了