Collaborative Research: ITR/NGS: Deja Vu: Transparent Checkpointing and Migration of Parallel Codes Over Grid Infrastructures
合作研究:ITR/NGS:似曾相识:网格基础设施上并行代码的透明检查点和迁移
基本信息
- 批准号:0325182
- 负责人:
- 金额:$ 26.03万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2004
- 资助国家:美国
- 起止时间:2004-04-15 至 2009-03-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
A daunting challenge is the evolution from today's computational Grid to a true cyberinfrastructure that seamlessly integrates resources ranging from small clusters in academic laboratories to the largest national supercomputing centers and provides ubiquitous access to high performance computing, research instrumentation, data warehouses and visualization. Realization of this future requires fundamental advances in transparent fault recovery mechanisms to mask component failures endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today's high performance computing (HPC) environments are based on clusters of COTS components, with no systemic solution for the reliability of the resource as a whole. Engendering stability in ever growing networked collections of cluster systems needs a software solution that provides reliable access to computing resources through transparent, efficient, and automatic checkpointing and recovery (CPR) mechanisms. This project aims to bring about this future through radically new approaches to longstanding problems in CPR and process migration by building an integrated system called Deja vu. Deja vu provides (a) a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications. (b) a novel post-compiler analysis system that transparently captures application state, (c) a systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge, (d) novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency, (e) a novel communications architecture that enables transparent migration of existing MPI/PVM codes without source-code modifications to either the application or the MPI/PVM libraries, (f) recoverable IO subsystems that can be tailored to specific storage environments, and (g) interfaces to and augmentation of the Globus Toolkit to effectively use the CPR and migration capabilities provided by this research. The core CPR and migration facilities of Deja vu will be surrounded by management, security, and scheduling facilities that (a) integrate with local scheduling systems (e.g., OpenPBS) and accounting systems for site-specific accounting and refunding of lost compute cycles and (b) extend the Globus security architecture with fine grain rights and dynamically created user accounts that allow the fluid resource control available under the Deja vu system to be fully exploited. The design goal of this project is not just to implement "point" solutions, but an integrated system that will constitute a fundamental component of both large-scale computing facilities and Grid infrastructures. Our research team (VT, PSC, ISR) has considerable experience in the design, development, deployment and support of complete solutions.
一个艰巨的挑战是从当今的计算网格到真正的网络基础架构的发展,该结构无缝地整合了从学术实验室的小群集到最大的国家超级计算中心的资源,并为高性能计算,研究仪器,数据仓库和可视化提供无处不在的访问。 实现这一未来需要透明的故障恢复机制的基本进步,以掩盖任何大规模计算资源的特有失败。尽管前几代超级计算机将可靠性设计到系统硬件中,但当今的高性能计算(HPC)环境却基于COTS组件的簇,没有系统的系统解决方案来确定整个资源的可靠性。 在不断增长的集群系统的网络集合中提高稳定性需要软件解决方案,该解决方案通过透明,高效和自动检查点和恢复(CPR)机制可靠地访问计算资源。 该项目旨在通过构建一个名为Deja Vu的集成系统来解决CPR和过程迁移中长期存在的问题的新方法来实现这一未来。 Deja Vu提供(a)透明的并行检查点和恢复机制,该机制从系统失败的任何组合中恢复,而无需对并行应用进行任何修改。 (b)一种透明地捕获应用程序状态的新型兼容器分析系统,(c)在单个框架中无缝整合用户发射和系统启动的检查点的系统体系结构,可有效地使用各种特定领域的知识,(d)新型的运行时机制,以实现固定的渐进式检查,以有效地捕获全球范围的启用(以实现全局范围),以实现全球的一致性(ETH),以实现全球的一致性(ET)(ETH),以实现一致性(ETH)的全局(ETH),以实现一致性(ETH),以实现一致性(ET)(ETH)的全局(ETH)(ETH)的一致性(ETH)(ETH)的全局(ETH)(ETH)的一致性(ETH)现有的MPI/PVM代码的透明迁移,而没有源代码对应用程序或MPI/PVM库进行修改,(F)可恢复的IO子系统,可以针对特定的存储环境进行定制,以及(g)与Globus Toolkit的接口和增强,以有效地使用CPR和迁移能力。 Deja Vu的核心CPR和迁移设施将被管理,安全和调度设施所包围,这些设施(a)与本地调度系统(例如OpenPB)集成在一起,以及用于站点的会计和退款的会计系统,并退款丢失的计算周期和(b)在良好的谷物权利中允许使用良好的谷物权利,以便在允许使用液体范围内进行动态范围,以充分利用流动性的范围,并允许be vuil vel vu a det ewu vui vuid the de那de那feu vuid。 该项目的设计目标不仅是实施“点”解决方案,而且是一个集成系统,它将构成大规模计算设施和电网基础架构的基本组成部分。我们的研究团队(VT,PSC,ISR)在完整解决方案的设计,开发,部署和支持方面具有丰富的经验。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Nathan Stone其他文献
Mediation of Interleukin‐23 and Tumor Necrosis Factor–Driven Reactive Arthritis by Chlamydia‐Infected Macrophages in SKG Mice
SKG 小鼠中衣原体感染的巨噬细胞介导白介素-23 和肿瘤坏死因子驱动的反应性关节炎
- DOI:
10.1002/art.41653 - 发表时间:
2021 - 期刊:
- 影响因子:13.3
- 作者:
X. Romand;Xiao Liu;M. A. Rahman;Z. A. Bhuyan;C. Douillard;R. A. Kedia;Nathan Stone;D. Roest;Zi Huai Chew;A. Cameron;L. Rehaume;Aurélie Bozon;Mohammed Habib;C. Armitage;M. Nguyen;B. Favier;K. Beagley;M. Maurin;P. Gaudin;Ranjeny Thomas;T. Wells;A. Baillet - 通讯作者:
A. Baillet
Nathan Stone的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
IL-35分泌型抑制细胞(iTr35)的分化发育及功能学研究
- 批准号:82173109
- 批准年份:2021
- 资助金额:55.00 万元
- 项目类别:面上项目
IL-35分泌型抑制细胞(iTr35)的分化发育及功能学研究
- 批准号:
- 批准年份:2021
- 资助金额:55 万元
- 项目类别:
小麦MATE转运蛋白基因ITR参与株型调控的分子机制研究
- 批准号:32001497
- 批准年份:2020
- 资助金额:24 万元
- 项目类别:青年科学基金项目
以iTr35为基础联合Tr1调节系统性硬化症中炎症反应和纤维化病变的作用机制研究
- 批准号:82060300
- 批准年份:2020
- 资助金额:33 万元
- 项目类别:地区科学基金项目
IL-35/iTr35细胞调控哮喘炎症亚型的分子机制研究
- 批准号:
- 批准年份:2020
- 资助金额:55 万元
- 项目类别:面上项目
相似海外基金
ITR Collaborative Research: Pervasively Secure Infrastructures (PSI): Integrating Smart Sensing, Data Mining, Pervasive Networking, and Community Computing
ITR 协作研究:普遍安全基础设施 (PSI):集成智能传感、数据挖掘、普遍网络和社区计算
- 批准号:
1404694 - 财政年份:2013
- 资助金额:
$ 26.03万 - 项目类别:
Continuing Grant
ITR-SCOTUS: A Resource for Collaborative Research in Speech Technology, Linguistics, Decision Processes, and the Law
ITR-SCOTUS:语音技术、语言学、决策过程和法律合作研究的资源
- 批准号:
1139735 - 财政年份:2011
- 资助金额:
$ 26.03万 - 项目类别:
Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
- 批准号:
0963973 - 财政年份:2009
- 资助金额:
$ 26.03万 - 项目类别:
Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
- 批准号:
1018072 - 财政年份:2009
- 资助金额:
$ 26.03万 - 项目类别:
Continuing Grant
ITR Collaborative Research: A Reusable, Extensible, Optimizing Back End
ITR 协作研究:可重用、可扩展、优化的后端
- 批准号:
0838899 - 财政年份:2008
- 资助金额:
$ 26.03万 - 项目类别:
Continuing Grant