Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale

合作研究:SHF:小型:大规模学习容错

基本信息

  • 批准号:
    2135309
  • 负责人:
  • 金额:
    $ 30万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2022
  • 资助国家:
    美国
  • 起止时间:
    2022-01-01 至 2024-12-31
  • 项目状态:
    已结题

项目摘要

In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making. Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在计算机辅助设计和对工程系统(例如汽车或半导体芯片)的分析中,在高性能计算机上模拟了计算模型,以表征和评估关键属性。这种高性能计算系统的巨大规模,例如,超过200亿晶体管(世界上最快的超级计算机之一),增加了来自宇宙辐射或诸如宇宙辐射或芯片芯片电压波动等事件的瞬时硬件故障的可能性。此类错误的可能性及其负面影响会进一步增加,因为这种模拟通常很长,并且单个数据字段或变量的损坏可能需要数周到几个月的重新计算,然后才能做出重大决策。该项目将开发自动化的方法,这些方法为这些应用程序带来了可容纳硬件故障的容错,这些应用不仅在多个工业领域都广泛使用,还可以提高气候或天气模型的预测能力以帮助关键决策。传统的耐故障方案可以是特定于应用程序的,需要重大的程序员精力重新设计或自定义大型软件,或者需要定期存储所有或大多数数据以限制恢复性,从而限制其可扩展性,从而限制其可扩展性,从而限制其可扩展性。该项目旨在通过为新的耐故障方案提供理论基础来解决这些局限性,该方案适用于基于迭代数值模拟的广泛应用程序,这些模拟会随着离散的空间域而随着时间的流逝而演变。该项目基于这样的前提:在此类基于物理的应用程序中,跨时间步骤(迭代)和空间域的解决方案向量组件的变化速率是自动确定应使用的关键计算变量并动态选择应应用的保障技术类型的关键指标。调查人员将追求三个关键方向:(i)通过开发弹性梯度指标来表征应用程序的固有弹性,(ii)制定和测试,以使弹性梯度的水平和类型适应弹性梯度的水平和类型,以降低计算范围的稳定性和(iii II)的构建框架,并降低计算范围,并以(iii II)进行构建框架,并系统的能力使用近似计算和共同安排技术。调查人员还将与应用和运行时系统开发人员紧密合作,以寻求更广泛的容忍度框架,开发专业的本科生和研究生课程进行学生培训,并为高中生提供研究经验。该奖项反映了NSF的法定任务,并通过该基金会的知识绩效和广泛的影响来通过评估来进行评估,并通过评估值得进行评估。

项目成果

期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Online Scheduling of Moldable Task Graphs under Common Speedup Models
Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts
通过软错误影响的机器学习预测对稀疏迭代求解器进行动态选择性保护
  • DOI:
    10.1145/3624062.3624117
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chen, Zizhao;Verrecchia, Thomas;Sun, Hongyang;Booth, Joshua;Raghavan, Padma
  • 通讯作者:
    Raghavan, Padma
共 2 条
  • 1
前往

Padma Raghavan其他文献

Multi-resource scheduling of moldable workflows
可成型工作流程的多资源调度
共 1 条
  • 1
前往

Padma Raghavan的其他基金

NSF I-Corps Hub (Track 1): Mid-South Region
NSF I-Corps 中心(轨道 1):中南部地区
  • 批准号:
    2229521
    2229521
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Cooperative Agreement
    Cooperative Agreement
SHF: Small: Embedded Graph Software-Hardware Models and Maps for Scalable Sparse Computations
SHF:小型:用于可扩展稀疏计算的嵌入式图软件硬件模型和映射
  • 批准号:
    1719674
    1719674
  • 财政年份:
    2016
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
SHF: Small: Embedded Graph Software-Hardware Models and Maps for Scalable Sparse Computations
SHF:小型:用于可扩展稀疏计算的嵌入式图软件硬件模型和映射
  • 批准号:
    1319448
    1319448
  • 财政年份:
    2013
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
DC: Small: Adaptive Sparse Data Mining On Multicores
DC:小型:多核上的自适应稀疏数据挖掘
  • 批准号:
    1017882
    1017882
  • 财政年份:
    2010
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Toward a Linear Time Sparse Solver with Locality-Enhanced Scalable Parallelism
具有局部增强的可扩展并行性的线性时间稀疏求解器
  • 批准号:
    0830679
    0830679
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
MRI: Acquistion of A Scalable Instrument for Discovery through Computing
MRI:获取可扩展的仪器,通过计算进行发现
  • 批准号:
    0821527
    0821527
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
CSR-SMA: Toward Model-Driven Multilevel Analysis and Optimization of Multicomponent Computer Systems
CSR-SMA:迈向模型驱动的多组件计算机系统的多级分析和优化
  • 批准号:
    0720749
    0720749
  • 财政年份:
    2007
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Continuing Grant
    Continuing Grant
Adaptive Software for Extreme-Scale Scientific Computing: Co-Managing Quality-Performance-Power Tradeoffs
用于超大规模科学计算的自适应软件:共同管理质量-性能-功耗权衡
  • 批准号:
    0444345
    0444345
  • 财政年份:
    2004
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Grant to Support Activities at the Eleventh SIAM Conference on Parallel Processing for Scientific Computing
资助支持第十一届 SIAM 科学计算并行处理会议的活动
  • 批准号:
    0340869
    0340869
  • 财政年份:
    2003
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Robust Limited Memory Hybrid Sparse Solvers
鲁棒的有限内存混合稀疏求解器
  • 批准号:
    0102537
    0102537
  • 财政年份:
    2001
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Continuing Grant
    Continuing Grant

相似国自然基金

支持二维毫米波波束扫描的微波/毫米波高集成度天线研究
  • 批准号:
    62371263
  • 批准年份:
    2023
  • 资助金额:
    52 万元
  • 项目类别:
    面上项目
腙的Heck/脱氮气重排串联反应研究
  • 批准号:
    22301211
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
水系锌离子电池协同性能调控及枝晶抑制机理研究
  • 批准号:
    52364038
  • 批准年份:
    2023
  • 资助金额:
    33 万元
  • 项目类别:
    地区科学基金项目
基于人类血清素神经元报告系统研究TSPYL1突变对婴儿猝死综合征的致病作用及机制
  • 批准号:
    82371176
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
FOXO3 m6A甲基化修饰诱导滋养细胞衰老效应在补肾法治疗自然流产中的机制研究
  • 批准号:
    82305286
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
  • 批准号:
    2331302
    2331302
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
  • 批准号:
    2331301
    2331301
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
  • 批准号:
    2412357
    2412357
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
  • 批准号:
    2402804
    2402804
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
    $ 30万
  • 项目类别:
    Standard Grant
    Standard Grant