CAREER: Rethinking HPC Resilience in the Exascale Era
职业:重新思考百亿亿次时代的 HPC 弹性
基本信息
- 批准号:1750503
- 负责人:
- 金额:$ 52.17万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-01-15 至 2019-11-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Resilience is one of the key exascale research challenges in high-performancecomputing (HPC). Due to much high error rates, exascale supercomputers couldmake little progress in computations, or might generate incorrect results due tofailures, rendering the exascale performance useless. Thechallenge is how to achieve a complete HPC resilience at exascale in a way thatdoes not increase the performance overhead, the power consumption, and thecomplexity of underlying hardware. To this end, this research project designsand develops low-cost hardware/software cooperative techniques for HPCresilience in the exascale era. This project involves four research goals: (1) low-cost soft error resiliencefor CPUs; intelligent compiler-architecture interaction can validate the lack oferrors and performs fine-grained recovery, thus eliminating SDC. (2)compiler-directed soft error resilience for commodity GPUs; it can remove thepower-hungry error-correcting code (ECC) logic from the GPU register fileswithout compromising their resilience. (3) lightweight nonvolatile memory (NVM)persistence; it can mitigate the overhead of traditional heavyweight HPCcheckpointing and support whole-system persistence for applications withoutirrevocable operations. (4) low-cost timing error resilience for aggressivevoltage scaling to maximize the energy-efficiency with program correctnessguarantee.The resulting artifacts and technologies are expected to contribute to thenation's competitiveness by addressing the challenge of building reliable HPCsystems. The research outcome impacts a broad range of any disciplines thatneed correct computation results thus requiring reliable computing systemscovering from embedded systems to HPC cloud. Consequently, use of the proposedtechniques will make the execution of current and emerging applications muchmore reliable, and therefore directly affect our way of life.There will be three types of data generated from this research project: (1)algorithms and models, (2) software prototype, (3) testing infrastructureincluding simulators and evaluation benchmarks and their traces, (4) educationalmaterials. All of our software tools will be open source and made available tothe public, laboratories and industry.
弹性是高表象计算(HPC)的主要Exascale研究挑战之一。由于错误率很高,Exascale超级计算机可能会在计算中取得很少的进展,或者由于Tofailures而产生错误的结果,从而使Exascale性能无用。 Thechallenge是如何在Exascale实现完整的HPC弹性,这种方式不会增加基础硬件的性能开销,功耗和综合性。 为此,该研究项目设计和开发了Exascale时代的HPCresilience的低成本/软件合作技术。该项目涉及四个研究目标:(1)CPU的低成本软错误弹性;智能编译器 - 架构交互可以验证缺乏的问题并执行精细的恢复,从而消除了SDC。 (2)针对商品GPU的编译器指导的软误差弹性;它可以从GPU寄存器文件中删除渴望误差校正代码(ECC)逻辑,而不会损害其弹性。 (3)轻巧的非易失性记忆(NVM)持久性;它可以减轻传统重量级HPCCHECKPOINT的开销,并支持无需操作的应用程序的全系统持久性。 (4)对积极电压缩放的低成本正时误差能力,以最大程度地利用程序来使用Program CrorgeNessGuarantee。预计所得的工件和技术有望通过应对建立可靠的HPCSystems的挑战来促进当时的竞争力。 研究结果影响了一系列广泛的学科,这些学科需要正确的计算结果,从而需要可靠的计算系统从嵌入式系统到HPC云。因此,使用拟议技术的使用将使当前和新兴应用程序的执行更加可靠,因此直接影响我们的生活方式。该研究项目将产生三种类型的数据:(1)算法和模型,(2)软件原型,(3)测试基础结构基础构造的模拟器和评估序列和评估率(4),(4),(4)。我们所有的软件工具将是开源的,并提供公共,实验室和行业。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
CommAnalyzer: Automated Estimation of Communication Cost and Scalability on HPC Clusters from Sequential Code
CommAnalyzer:根据顺序代码自动估计 HPC 集群的通信成本和可扩展性
- DOI:
- 发表时间:2018
- 期刊:
- 影响因子:0
- 作者:Helal, Ahmed;Jung, Changhee;Feng, Wu-chun;Hanafy, Yasser
- 通讯作者:Hanafy, Yasser
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Changhee Jung其他文献
Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection
低成本的软错误恢复能力,具有统一的数据验证和细粒度恢复,用于基于声学传感器的检测
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Qingrui Liu;Changhee Jung;Dongyoon Lee;Devesh Tiwari - 通讯作者:
Devesh Tiwari
Adaptive execution techniques of parallel programs for multiprocessors
多处理器并行程序的自适应执行技术
- DOI:
- 发表时间:
2010 - 期刊:
- 影响因子:0
- 作者:
Jaejin Lee;Jungho Park;Honggyu Kim;Changhee Jung;Daeseob Lim;Sang - 通讯作者:
Sang
CommAnalyzer: Automated Estimation of Communication Cost on HPC Clusters Using Sequential Code
CommAnalyzer:使用顺序代码自动估计 HPC 集群上的通信成本
- DOI:
- 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
A. Helal;Changhee Jung;Wu;Y. Hanafy - 通讯作者:
Y. Hanafy
ProRace
职业竞赛
- DOI:
10.1145/3093336.3037708 - 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Tong Zhang;Changhee Jung;Dongyoon Lee - 通讯作者:
Dongyoon Lee
Clover: Compiler Directed Lightweight Soft Error Resilience
Clover:编译器导向的轻量级软错误恢复能力
- DOI:
10.1145/2670529.2754959 - 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Qingrui Liu;Changhee Jung;Dongyoon Lee;Devesh Tiwari - 通讯作者:
Devesh Tiwari
Changhee Jung的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Changhee Jung', 18)}}的其他基金
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
- 批准号:
2314681 - 财政年份:2023
- 资助金额:
$ 52.17万 - 项目类别:
Continuing Grant
Collaborative Research: SHF: Small: Enabling Caches and GPUs for Energy Harvesting Systems
合作研究:SHF:小型:为能量收集系统启用缓存和 GPU
- 批准号:
2153749 - 财政年份:2022
- 资助金额:
$ 52.17万 - 项目类别:
Standard Grant
CAREER: Rethinking HPC Resilience in the Exascale Era
职业:重新思考百亿亿次时代的 HPC 弹性
- 批准号:
2001124 - 财政年份:2019
- 资助金额:
$ 52.17万 - 项目类别:
Continuing Grant
SHF: Small: Compiler and Architectural Techniques for Soft Error Resilience
SHF:小型:软错误恢复能力的编译器和架构技术
- 批准号:
1527463 - 财政年份:2015
- 资助金额:
$ 52.17万 - 项目类别:
Standard Grant
相似海外基金
PROTSENS Rethinking Alternative PROTein Extraction: Decoding SENsory-Protein Extraction Relationships
PROTSENS 重新思考替代性蛋白质提取:解码感觉-蛋白质提取关系
- 批准号:
EP/Z000785/1 - 财政年份:2024
- 资助金额:
$ 52.17万 - 项目类别:
Fellowship
A Brave New World for Japanese Shakespeare Adaptations: Rethinking Shakespeare Studies through Adaptations
日本莎士比亚改编的美丽新世界:通过改编重新思考莎士比亚研究
- 批准号:
23K21920 - 财政年份:2024
- 资助金额:
$ 52.17万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Care and Repair: Rethinking Contemporary Curation for Conditions of Crisis
护理与修复:重新思考危机条件下的当代策展
- 批准号:
DP240102206 - 财政年份:2024
- 资助金额:
$ 52.17万 - 项目类别:
Discovery Projects
Caring Communities 1800-present: Rethinking Children's Social Care
关爱社区 1800 年至今:重新思考儿童的社会关怀
- 批准号:
MR/X034968/1 - 财政年份:2024
- 资助金额:
$ 52.17万 - 项目类别:
Fellowship
High-rise landscapes: The afterlives of tower block 'failure' and rethinking urban futures
高层景观:塔楼“失败”的后遗症和重新思考城市未来
- 批准号:
MR/Y003586/1 - 财政年份:2024
- 资助金额:
$ 52.17万 - 项目类别:
Fellowship