Collaborative Research: ITR/NGS: Deja Vu: Transparent Checkpointing and Migration of Parallel Codes Over Grid Infrastructures

合作研究:ITR/NGS:似曾相识:网格基础设施上并行代码的透明检查点和迁移

基本信息

项目摘要

A daunting challenge is the evolution from today's computational Grid to a true cyberinfrastructure that seamlessly integrates resources ranging from small clusters in academic laboratories to the largest national supercomputing centers and provides ubiquitous access to high performance computing, research instrumentation, data warehouses and visualization. Realization of this future requires fundamental advances in transparent fault recovery mechanisms to mask component failures endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today's high performance computing (HPC) environments are based on clusters of COTS components, with no systemic solution for the reliability of the resource as a whole. Engendering stability in ever growing networked collections of cluster systems needs a software solution that provides reliable access to computing resources through transparent, efficient, and automatic checkpointing and recovery (CPR) mechanisms. This propoject aims to bring about this future through radically new approaches to longstanding problems in CPR and process migration by building an integrated system called Daja vu. Daja vu provides (a) a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications. (b) a novel post-compiler analysis system that transparently captures application state, (c) a systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge, (d) novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency, (e) a novel communications architecture that enables transparent migration of existing MPI/PVM codes without source-code modifications to either the application or the MPI/PVM libraries, (f) recoverable IO subsystems that can be tailored to specific storage environments, and (g) interfaces to and augmentation of the Globus Toolkit to effectively use the CPR and migration capabilities provided by this research. The core CPR and migration facilities of Deja vu will be surrounded by management, security, and scheduling facilities that (a) integrate with local scheduling systems (e.g., OpenPBS) and accounting systems for site-specific accounting and refunding of lost compute cycles and (b) extend the Globus security architecture with fine grain rights and dynamically created user accounts that allow the fluid resource control available under the Deja vu system to be fully exploited. The design goal of this project is not just to implement "point" solutions, but an integrated system that will constitute a fundamental component of both large-scale computing facilities and Grid infrastructures. Our research team (VT, PSC, ISR) has considerable experience in the design, development, deployment and support of complete solutions.
一个艰巨的挑战是从当今的计算网格演变为真正的网络基础设施,无缝集成从学术实验室的小型集群到最大的国家超级计算中心的资源,并提供对高性能计算、研究仪器、数据仓库和可视化的无处不在的访问。 实现这一未来需要透明故障恢复机制的根本性进步,以掩盖任何大规模计算资源所特有的组件故障。虽然前几代超级计算机将可靠性融入到系统硬件中,但当今的高性能计算 (HPC) 环境基于 COTS 组件集群,没有针对整个资源可靠性的系统解决方案。 为了在不断增长的集群系统网络集合中实现稳定性,需要一种软件解决方案,通过透明、高效、自动的检查点和恢复 (CPR) 机制提供对计算资源的可靠访问。 该项目旨在通过构建一个名为“Daja vu”的集成系统,通过全新的方法解决心肺复苏和流程迁移中长期存在的问题,从而实现这一未来。 Daja vu 提供了 (a) 透明的并行检查点和恢复机制,可以从任何系统故障组合中恢复,而无需对并行应用程序进行任何修改。 (b) 一种新颖的编译后分析系统,可透明地捕获应用程序状态,(c) 一种系统架构,可将用户启动和系统启动的检查点无缝集成在单个框架中,从而能够有效利用各种特定领域的知识, (d) 用于透明增量检查点的新颖运行时机制,以有效捕获维护全局一致性所需的最少量状态, (e) 一种新颖的通信架构,无需对源代码进行修改即可透明迁移现有 MPI/PVM 代码应用程序或MPI/PVM 库,(f) 可针对特定存储环境进行定制的可恢复 IO 子系统,以及 (g) Globus 工具包的接口和增强,以有效地使用本研究提供的 CPR 和迁移功能。 Deja vu 的核心 CPR 和迁移设施将被管理、安全和调度设施包围,这些设施 (a) 与本地调度系统(例如 OpenPBS)和会计系统集成,用于特定站点的会计和丢失计算周期的退款,以及( b) 通过细粒度权限和动态创建的用户帐户扩展 Globus 安全架构,从而允许充分利用 Deja vu 系统下可用的流体资源控制。该项目的设计目标不仅仅是实现“点”解决方案,而是一个集成系统,该系统将构成大规模计算设施和网格基础设施的基本组成部分。我们的研究团队(VT、PSC、ISR)在完整解决方案的设计、开发、部署和支持方面拥有丰富的经验。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Srinidhi Varadarajan其他文献

Srinidhi Varadarajan的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Srinidhi Varadarajan', 18)}}的其他基金

Collaborative Research: ITR/NGS: Fast Wireless Network Simulation Using Spatio-Temporal Dilations
合作研究:ITR/NGS:使用时空扩张的快速无线网络仿真
  • 批准号:
    0325410
  • 财政年份:
    2004
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
Acquisition of a Large Scale Cluster for Research in Computational Sciences and Engineering
收购用于计算科学和工程研究的大规模集群
  • 批准号:
    0321066
  • 财政年份:
    2003
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
CAREER: Weaving a Code Tapestry: A Compiler Directed Framework for Scalable Network Emulation
职业:编织代码挂毯:用于可扩展网络仿真的编译器导向框架
  • 批准号:
    0133840
  • 财政年份:
    2002
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant

相似国自然基金

IL-35分泌型抑制细胞(iTr35)的分化发育及功能学研究
  • 批准号:
  • 批准年份:
    2021
  • 资助金额:
    55 万元
  • 项目类别:
以iTr35为基础联合Tr1调节系统性硬化症中炎症反应和纤维化病变的作用机制研究
  • 批准号:
    82060300
  • 批准年份:
    2020
  • 资助金额:
    33 万元
  • 项目类别:
    地区科学基金项目
小麦MATE转运蛋白基因ITR参与株型调控的分子机制研究
  • 批准号:
    32001497
  • 批准年份:
    2020
  • 资助金额:
    24 万元
  • 项目类别:
    青年科学基金项目
IL-35/iTr35细胞调控哮喘炎症亚型的分子机制研究
  • 批准号:
  • 批准年份:
    2020
  • 资助金额:
    55 万元
  • 项目类别:
    面上项目
iTR35诱导HBV特异性CTL耗竭的分子机制及其靶向干预研究
  • 批准号:
    81672092
  • 批准年份:
    2016
  • 资助金额:
    58.0 万元
  • 项目类别:
    面上项目

相似海外基金

ITR Collaborative Research: Pervasively Secure Infrastructures (PSI): Integrating Smart Sensing, Data Mining, Pervasive Networking, and Community Computing
ITR 协作研究:普遍安全基础设施 (PSI):集成智能传感、数据挖掘、普遍网络和社区计算
  • 批准号:
    1404694
  • 财政年份:
    2013
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
ITR-SCOTUS: A Resource for Collaborative Research in Speech Technology, Linguistics, Decision Processes, and the Law
ITR-SCOTUS:语音技术、语言学、决策过程和法律合作研究的资源
  • 批准号:
    1139735
  • 财政年份:
    2011
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
  • 批准号:
    1018072
  • 财政年份:
    2009
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
  • 批准号:
    0963973
  • 财政年份:
    2009
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
ITR Collaborative Research: A Reusable, Extensible, Optimizing Back End
ITR 协作研究:可重用、可扩展、优化的后端
  • 批准号:
    0838899
  • 财政年份:
    2008
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了