EAGER: Recomputation-Based Checkpointing for Sparse Matrices

EAGER:基于重新计算的稀疏矩阵检查点

基本信息

项目摘要

High-performance computing (HPC) is essential for maintaining the US international competitive edge and leadership in science, technology, engineering, and mathematics (STEM). Advances in HPC are vital to national interests by providing infrastructure for scientific discovery that improves the national health, prosperity, welfare, and defense. To solve large-scale scientific problems, HPC relies on an increasing number of nodes and components, which makes it likelier for long-running computation to be interrupted with failures before completing. A critical technique to ensure computation completion is checkpointing. Checkpointing allows snapshots of the computation to be saved so that when a failure occurs, computation state can be restored from the last snapshot and continues execution, rather than restarting from the beginning. The research in this project seeks to advance the state-of-the-art checkpointing technique by making it significantly faster and lowering its cost. This project also plans to contribute to the training of future workforce by providing students with exposure to the mechanisms and inefficiencies of current checkpointing mechanisms on NVMM, and the new in-place checkpointing. The project seeks to increase participation of minority and under-represented groups and involves undergraduates in research.Prior approaches to checkpointing rely on taking a snapshot of the system state (system-level checkpointing) or the application state (application-level checkpointing) and saving it to secondary non-volatile storage. With the advent of non-volatile main memory (NVMM), a new approach to checkpointing becomes possible. In contrast to traditional approaches to checkpointing that rely on storing separate snapshots in a separate secondary storage, the project uses a new approach where checkpoints can be constructed in-place in the NVMM utilizing the working data structures used by the applications. This allows only very minimal additional state beyond what the program already saves to memory, making checkpointing significantly faster and incurring lower cost, in turn providing further HPC scaling.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
高性能计算(HPC)对于维持美国国际竞争优势和领导力和数学(STEM)至关重要。 HPC的进步对于改善国家健康,繁荣,福利和防御的科学发现,对国家利益至关重要。为了解决大规模的科学问题,HPC依赖越来越多的节点和组件,这使得在完成之前会被故障中断长时间的计算更有可能。确保计算完成的关键技术是检查点。检查点允许保存计算的快照,以便在发生故障时,可以从最后一个快照恢复计算状态并继续执行,而不是从一开始重新启动。该项目中的研究旨在通过使其更快并降低其成本来推进最先进的检查点技术。该项目还计划通过为学生提供NVMM当前检查点机制的机制和效率低下,并为未来的劳动力培训做出贡献。该项目旨在增加少数群体和代表性不足的群体的参与,并涉及研究研究。随着非挥发性主内存(NVMM)的出现,可以进行一种新的检查点方法。 与依靠在单独的辅助存储中存储单独的快照的传统方法相反,该项目使用了一种新方法,可以利用应用程序使用的工作数据结构在NVMM中构建检查点。这仅允许超出该计划已经将其节省到内存的额外状态极少,从而使检查站更快地和较低的成本,进一步提供了进一步的HPC缩放。该奖项反映了NSF的法定任务,并被认为是值得通过基金会的知识分子优点和更广泛影响的审查标准来通过评估来获得支持的。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Yan Solihin其他文献

Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries
通过自我失效 TLB 条目避免 TLB 被击落
Analytically modeling the memory hierarchy performance of modern processor systems
对现代处理器系统的内存层次结构性能进行分析建模
  • DOI:
  • 发表时间:
    2011
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yan Solihin;Fang Liu
  • 通讯作者:
    Fang Liu
耳介伝達関数および耳介画像を用いた個人認証についての検討
利用耳廓传递函数和耳廓图像进行个人认证的研究
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Reem Elkhouly;Mohammad Alshboul;Akihiro Hayashi;Yan Solihin;Keiji Kimura;井谷俊仁,喜多俊輔 梶川嘉延
  • 通讯作者:
    井谷俊仁,喜多俊輔 梶川嘉延
Noise and background removal from handwriting images
手写图像中的噪声和背景去除
Persistent Memory: Abstractions, Abstractions, and Abstractions
持久内存:抽象、抽象、还是抽象
  • DOI:
    10.1109/mm.2018.2885589
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    3.6
  • 作者:
    Yan Solihin
  • 通讯作者:
    Yan Solihin

Yan Solihin的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Yan Solihin', 18)}}的其他基金

Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
  • 批准号:
    2312206
  • 财政年份:
    2023
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS Core: Medium: Understanding and Strengthening Memory Security for Non-Volatile Memory
合作研究:CNS 核心:中:理解和加强非易失性内存的内存安全性
  • 批准号:
    2106629
  • 财政年份:
    2021
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Continuing Grant
Collaborative Research: PPoSS: Planning: Scaling Secure Serverless Computing on Hetergeneous Datacenters
协作研究:PPoSS:规划:在异构数据中心上扩展安全无服务器计算
  • 批准号:
    2028836
  • 财政年份:
    2020
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SHF: Small: Collaborative Research: Efficient Memory Persistency for GPUs
SHF:小型:协作研究:GPU 的高效内存持久性
  • 批准号:
    1908079
  • 财政年份:
    2019
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
CNS Core: Medium: Collaborative Research: Persistent memory objects for consistent sharing in Non-Volatile Main Memories
CNS 核心:中:协作研究:用于非易失性主存储器中一致共享的持久内存对象
  • 批准号:
    1900724
  • 财政年份:
    2019
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Continuing Grant
EAGER: Recomputation-Based Checkpointing for Sparse Matrices
EAGER:基于重新计算的稀疏矩阵检查点
  • 批准号:
    1829142
  • 财政年份:
    2018
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SI2-SSE: TLDS: Transactional Lock-Free Data Structures
SI2-SSE:TLDS:事务性无锁数据结构
  • 批准号:
    1740095
  • 财政年份:
    2017
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SHF: Small: Towards a Versatile Analytical Modeling Toolset for Evaluating Memory Hierarchy Design
SHF:小型:用于评估内存层次结构设计的多功能分析建模工具集
  • 批准号:
    1116540
  • 财政年份:
    2011
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SHF: Small: Collaborative Research: Beyond Secure Processors - Securing Systems Against Hardware
SHF:小型:协作研究:超越安全处理器 - 保护系统免受硬件攻击
  • 批准号:
    0915501
  • 财政年份:
    2009
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
CSR:Small:Efficient and Predictable Memory Hierarchies for High-Performance Embedded Systems
CSR:小型:高性能嵌入式系统的高效且可预测的内存层次结构
  • 批准号:
    0915503
  • 财政年份:
    2009
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant

相似海外基金

EAGER: Recomputation-Based Checkpointing for Sparse Matrices
EAGER:基于重新计算的稀疏矩阵检查点
  • 批准号:
    1829142
  • 财政年份:
    2018
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了