SHF: Small: Collaborative Research: A Parallel Graph-Based Paradigm for HPC Parallel File System Checkers
SHF:小型:协作研究:基于并行图的 HPC 并行文件系统检查器范例
基本信息
- 批准号:1910747
- 负责人:
- 金额:$ 25万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2019
- 资助国家:美国
- 起止时间:2019-07-15 至 2023-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Modern high performance computing (HPC) platforms rely on large-scale parallel file systems for serving data accesses of scientific applications. These parallel file systems often run on expensive hardware and are usually well-maintained, but they may still experience failures and run into inconsistent states for various reasons (e.g., hardware faults, software bugs, configuration errors). When the state becomes inconsistent, a checking and repairing program called checker is the last line of defense to bring the system back to consistency. Nevertheless, today's checkers are error-prone and time-consuming to run. With the scale and complexity keeps increasing, the situation will likely get worse. This project aims to enable scalable, high performance checking and repairing of widely used parallel file systems through a new parallel graph-based model. The success of this project will dramatically change how parallel file system checkers would be used. Such an effort is a fundamental step towards building highly reliable future HPC parallel file systems for scientific discovery. In addition, this project integrates the research activities with education and outreach efforts to train broadly inclusive and globally competitive science workforce. The project consists of three thrusts. The first task focuses on constructing a general graph-based metadata model to abstract key metadata and consistency rules; the second task focuses on efficiently retrieving metadata from real systems and instantiating metadata graphs; the third task focuses on building a graph-based consistency checking runtime engine to conduct the checking in parallel to gain scalable high performance. This includes constructing a generic graph structure for representing different file system metadata, extracting the consistency rules among metadata items for checking, and defining a set of interfaces to facilitate building the graph model for other file systems. The project will explore compiling all consistency rules into a unified executable called ?blob?, which can be run in parallel in all compute nodes, and optimize the runtime graph engine to accommodate dependencies and achieve high performance.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
现代高性能计算(HPC)平台依靠大规模并行文件系统来服务科学应用程序的数据访问。这些并行文件系统通常会在昂贵的硬件上运行,并且通常会得到良好的维护,但是由于各种原因(例如,硬件故障,软件错误,配置错误),它们仍然可能遇到失败并陷入不一致的状态。当状态变得不一致时,一个名为Checker的检查和维修程序是将系统恢复一致性的最后一条防线。尽管如此,当今的棋子还是容易出错的,而且很耗时。随着规模和复杂性的不断提高,情况可能会恶化。该项目旨在通过新的基于平行图的模型来启用可扩展的高性能检查和修复广泛使用的并行文件系统。 该项目的成功将极大地改变如何使用并行文件系统检查器。这样的努力是迈向建立高度可靠的未来HPC并行文件系统以进行科学发现的基本步骤。此外,该项目将研究活动与教育和外展活动融合在一起,以培训广泛包容和全球竞争性的科学劳动力。该项目由三个推力组成。第一个任务着重于构建一个基于图形的元数据模型,以抽象关键元数据和一致性规则。第二个任务着重于从实际系统中有效检索元数据并实例化元数据图。第三个任务着重于构建基于图的一致性检查运行时引擎,以并行进行检查以获得可扩展的高性能。 这包括构建用于表示不同文件系统元数据的通用图形结构,在元数据项目中提取一致性规则,并定义一组接口,以促进为其他文件系统构建图形模型。该项目将探索将所有一致性规则探索为一个统一的可执行文件,称为“ blob?”,哪些可以在所有计算节点中并行运行,并优化运行时图引擎以适应依赖性并实现高性能。该奖项反映了NSF的法定任务,并通过该基金会的知识优点和广泛的criperia criperia criperia criperia criperia criperia criperia criperia criperia criperia criperia criperia criperia criperia criperia rectectuation均可通过评估值得。
项目成果
期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis
- DOI:10.1109/ipdps54959.2023.00028
- 发表时间:2023-05
- 期刊:
- 影响因子:0
- 作者:Di Zhang;Chris Egersdoerfer;Tabassum Mahmud;Mai Zheng;Dong Dai
- 通讯作者:Di Zhang;Chris Egersdoerfer;Tabassum Mahmud;Mai Zheng;Dong Dai
On the Reproducibility of Bugs in File-System Aware Storage Applications
关于文件系统感知存储应用程序中错误的再现性
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Zhang, Duo;Mahmud, Tabassum;Gatla, Om Rameshwar;Han, Runzhou;Chen, Yong;Zheng, Mai.
- 通讯作者:Zheng, Mai.
A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
- DOI:10.1145/3483447
- 发表时间:2022-03
- 期刊:
- 影响因子:0
- 作者:Runzhou Han;Om Rameshwar Gatla;Mai Zheng;Jinrui Cao;Di Zhang;Dong Dai;Yong Chen;J. Cook
- 通讯作者:Runzhou Han;Om Rameshwar Gatla;Mai Zheng;Jinrui Cao;Di Zhang;Dong Dai;Yong Chen;J. Cook
PROV-IO: An I/O-Centric Provenance Framework for Scientific Data on HPC Systems
PROV-IO:HPC 系统上以 I/O 为中心的科学数据来源框架
- DOI:10.1145/3502181.3531477
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Han, Runzhou;Byna, Suren;Tang, Houjun;Dong, Bin;Zheng, Mai
- 通讯作者:Zheng, Mai
ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit
ConfD:分析文件系统的配置依赖性以获得乐趣和利润
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Mahmud, Tabassum;Gatla, Om R.;Zhang, Duo;Love, Carson;Bumann, Ryan;Zheng, Mai
- 通讯作者:Zheng, Mai
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mai Zheng其他文献
On Failure Diagnosis of the Storage Stack
浅谈存储堆栈的故障诊断
- DOI:
- 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Duo Zhang;Om Rameshwar Gatla;Runzhou Han;Mai Zheng - 通讯作者:
Mai Zheng
A command-level study of Linux kernel bugs
Linux 内核 bug 的命令级研究
- DOI:
- 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Yiliang Shi;Danny Murillo;Simeng Wang;Jinrui Cao;Mai Zheng - 通讯作者:
Mai Zheng
A Cross-Layer Approach for Diagnosing Storage System Failures
诊断存储系统故障的跨层方法
- DOI:
- 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Duo Zhang;C. Gupta;Mai Zheng;A. Manzanares;F. Blagojevic;Cyril Guyot - 通讯作者:
Cyril Guyot
Emulating Realistic Flash Device Errors with High Fidelity
高保真模拟真实闪存设备错误
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Simeng Wang;Jinrui Cao;Danny V. Murillo;Yiliang Shi;Mai Zheng - 通讯作者:
Mai Zheng
Image Stitching of Scenes with Large Misregistration
重合失调较大的场景的图像拼接
- DOI:
10.1109/iccsit.2008.30 - 发表时间:
2008 - 期刊:
- 影响因子:0
- 作者:
Mai Zheng;Antai Guo;W. Zhong;Li Guo - 通讯作者:
Li Guo
Mai Zheng的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Mai Zheng', 18)}}的其他基金
CAREER: Towards Full-Stack Crash Consistency
职业生涯:实现全栈崩溃一致性
- 批准号:
1943204 - 财政年份:2020
- 资助金额:
$ 25万 - 项目类别:
Continuing Grant
CRII: CSR: Towards Pinpointing the Root Causes of Failures in Flash-based Storage Systems
CRII:CSR:找出基于闪存的存储系统故障的根本原因
- 批准号:
1855565 - 财政年份:2018
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
SHF: Small: Collaborative Research: Uncovering Vulnerabilities in Parallel File Systems for Reliable High Performance Computing
SHF:小型:协作研究:发现并行文件系统中的漏洞以实现可靠的高性能计算
- 批准号:
1853714 - 财政年份:2018
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
SHF: Small: Collaborative Research: Uncovering Vulnerabilities in Parallel File Systems for Reliable High Performance Computing
SHF:小型:协作研究:发现并行文件系统中的漏洞以实现可靠的高性能计算
- 批准号:
1717630 - 财政年份:2017
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
CRII: CSR: Towards Pinpointing the Root Causes of Failures in Flash-based Storage Systems
CRII:CSR:找出基于闪存的存储系统故障的根本原因
- 批准号:
1566554 - 财政年份:2016
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
相似国自然基金
基于超宽频技术的小微型无人系统集群协作关键技术研究与应用
- 批准号:
- 批准年份:2020
- 资助金额:57 万元
- 项目类别:面上项目
异构云小蜂窝网络中基于协作预编码的干扰协调技术研究
- 批准号:61661005
- 批准年份:2016
- 资助金额:30.0 万元
- 项目类别:地区科学基金项目
密集小基站系统中的新型接入理论与技术研究
- 批准号:61301143
- 批准年份:2013
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
ScFVCD3-9R负载Bcl-6靶向小干扰RNA治疗EAMG的试验研究
- 批准号:81072465
- 批准年份:2010
- 资助金额:31.0 万元
- 项目类别:面上项目
基于小世界网络的传感器网络研究
- 批准号:60472059
- 批准年份:2004
- 资助金额:21.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331302 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331301 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
- 批准号:
2412357 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Technical Debt Management in Dynamic and Distributed Systems
合作研究:SHF:小型:动态和分布式系统中的技术债务管理
- 批准号:
2232720 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Quasi Weightless Neural Networks for Energy-Efficient Machine Learning on the Edge
合作研究:SHF:小型:用于边缘节能机器学习的准失重神经网络
- 批准号:
2326895 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant