Collaborative Research: Elements: VLCC-States: Versioned Lineage-Driven Checkpointing of Composable States
协作研究:元素:VLCC-States:可组合状态的版本化谱系驱动检查点
基本信息
- 批准号:2411387
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2024
- 资助国家:美国
- 起止时间:2024-10-01 至 2027-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Checkpointing is a fundamental pattern used by a variety of scientific applications at both small and large computing scales. Widely adopted for resilience purposes by long-running applications (i.e., checkpoint-restart), it has seen an explosion of additional use cases that directly help applications progress faster and reduce time-to-solution even in the absence of failures: adjoint computations (essential in financial modeling, weather prediction, computational fluid dynamics, seismic imaging, and control theory) need to capture a history of checkpoints in a forward pass, which are then revisited in a backward pass. Training artificial intelligence models, increasingly used by scientific applications, often results in trajectories that do not lead to convergence or may lead to undesirable patterns, prompting the need to backtrack to an earlier checkpoint of the learning model to try an alternative. Transfer learning and fine-tuning using a previous checkpoint of a learning model can be used to adapt the training more quickly, avoiding expensive training from scratch. Many other use cases are important in scientific computing: suspend-resume (e.g., to preempt a long-running job in favor of a higher priority job), migration (checkpoint on one machine, restart on another), debugging (replay a problematic code region to reproduce errors without starting from scratch), and reproducibility (checkpoint and compare intermediate data during repeated runs). Despite broad applicability, current state-of-the-art solutions lack the flexibility, performance, and scalability needed to address these scenarios efficiently. The Versioned Lineage-Driven Checkpointing of Composable States (VLCC-States) project aims to fill this gap. It will streamline the development and use of checkpointing patterns for scientific applications, which simplifies and improves the reusability of integration efforts across different communities, improves awareness of the multitude of checkpointing scenarios, reduces development effort and cost, and enables flexible customization to extract the best performance and scalability for the desired application scenario.VLCC-States provides technical innovation in three areas. First, it introduces composable providers of intermediate states, which hide the complexity of capturing and assembling checkpoints of distributed data structures and their transformations across different modules and programming languages while optimizing their layout to eliminate redundancies, reduce sizes, and improve performance. Second, it provides multi-level co-optimized caching and prefetching techniques, which enable scalable management of the life cycle of checkpoints for interleavings of capture and reuse operations on heterogeneous storage stacks under concurrency. Third, it develops specialized checkpointing tools for large Artificial Intelligence models, with a focus on integration with PyTorch and DeepSpeed, to enable users to transparently take advantage of high-performance and scalable checkpointing using a familiar API. This project will engage partners in industry and national research laboratories to co-design VLCC-States, tune its capabilities, and evaluate its implementation. This project will undertake educational and broadening participation activities to improve community awareness and understanding of challenges in scientific data management.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
检查点是小型和大型计算量表的各种科学应用都使用的基本模式。通过长期运行的应用(即检查点位均值)广泛用于弹性目的,它看到了其他用例的爆炸爆炸,这些案例直接有助于应用程序更快地进展并减少更快的进展,甚至在没有故障的情况下减少时间到期:邻居计算:在财务建模,天气模型,天气预测,计算动力学,seist seist toction the Inception中的经济模型中必不可少) 经过。培训人工智能模型越来越多地由科学应用使用,通常会导致轨迹不会导致收敛或可能导致不良模式,从而促使需要回到早期的学习模型检查站以尝试替代方案。可以使用以前的学习模型检查站进行转移学习和微调来更快地调整培训,从而避免昂贵的培训。许多其他用例在科学计算中很重要:悬浮库(例如,要抢占长期运行的工作,支持更高的优先级工作),迁移(一台机器上的检查点,重新启动另一台机器),调试(重现有问题的代码区域以重现错误,而无需从scratch开始),并且在不scratch的情况下进行了重复和重复的数据,并在重复的数据中进行了重复数据)。尽管有广泛的适用性,但当前的最新解决方案缺乏有效解决这些方案所需的灵活性,性能和可伸缩性。可组合状态(VLCC-States)项目版本的谱系驱动的检查点旨在填补这一空白。它将简化用于科学应用的检查点模式的开发和使用,从而简化并改善了不同社区整合工作的可重复性,提高了对多种检查点场景的认识,减少开发工作和成本,并启用灵活的自定义,以提取所需的应用程序场景的最佳性能和可扩展性。首先,它引入了中间状态的可组合提供者,这些提供商隐藏了分布式数据结构的捕获和组装检查点的复杂性及其在不同模块和编程语言上的转换,同时优化其布局以消除冗余,减少尺寸,并提高性能。其次,它提供了多级合作的缓存和预取技术,该技术可以使检查点的生命周期可扩展管理,以使捕获和重复使用操作在同一同时进行异质储存堆栈上进行交织。第三,它为大型人工智能模型开发了专门的检查点工具,重点是与Pytorch和DeepSpeed集成,以使用户能够使用熟悉的API透明地利用高性能和可扩展的检查点。该项目将与行业和国家研究实验室的合作伙伴互动,以共同设计VLCC国家,调整其能力并评估其实施。该项目将开展教育和扩大参与活动,以提高社区的意识和对科学数据管理中挑战的理解。该奖项反映了NSF的法定使命,并被认为是值得通过基金会的知识分子优点和更广泛的影响评估的评估来支持的。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
M Mustafa Rafique其他文献
M Mustafa Rafique的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('M Mustafa Rafique', 18)}}的其他基金
Collaborative Research: CNS Core: Medium:HardLambda: A new FaaS Abstraction for Cross-Stack Resource Management in Disaggregated Datacenters
协作研究:CNS 核心:Medium:HardLambda:分解数据中心跨堆栈资源管理的新 FaaS 抽象
- 批准号:
2106635 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
相似国自然基金
杨柳坪超大型Cu-Ni-PGE矿床硫化物熔体固化过程铂族元素地球化学行为精细研究
- 批准号:42303019
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
深海沉积物中稀土元素超常富集机制――基于富稀土沉积物与水岩实验的对比研究
- 批准号:42372116
- 批准年份:2023
- 资助金额:53 万元
- 项目类别:面上项目
微量元素钒调控能量代谢用于监控结直肠癌治疗及转移抑制的机制研究
- 批准号:62305121
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
稻米镁元素积累新主效QTL克隆和功能研究及其育种利用
- 批准号:32372095
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
基于可控同位素中子源的月表元素探测机制与载荷实现关键技术研究
- 批准号:42374226
- 批准年份:2023
- 资助金额:53 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: Elements: Linking geochemical proxy records to crustal stratigraphic context via community-interactive cyberinfrastructure
合作研究:要素:通过社区交互式网络基础设施将地球化学代理记录与地壳地层背景联系起来
- 批准号:
2311092 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Elements: Lattice QCD software for nuclear physics on heterogeneous architectures
合作研究:Elements:用于异构架构核物理的 Lattice QCD 软件
- 批准号:
2311430 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Elements: ProDM: Developing A Unified Progressive Data Management Library for Exascale Computational Science
协作研究:要素:ProDM:为百亿亿次计算科学开发统一的渐进式数据管理库
- 批准号:
2311757 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: FuSe: Monolithic 3D Integration (M3D) of 2D Materials-Based CFET Logic Elements towards Advanced Microelectronics
合作研究:FuSe:面向先进微电子学的基于 2D 材料的 CFET 逻辑元件的单片 3D 集成 (M3D)
- 批准号:
2329189 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Experimental and computational constraints on the isotope fractionation of Mossbauer-inactive elements in mantle minerals
合作研究:地幔矿物中穆斯堡尔非活性元素同位素分馏的实验和计算约束
- 批准号:
2246686 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant