Collaborative Research: Elements: VLCC-States: Versioned Lineage-Driven Checkpointing of Composable States
协作研究:元素:VLCC-States:可组合状态的版本化谱系驱动检查点
基本信息
- 批准号:2411387
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2024
- 资助国家:美国
- 起止时间:2024-10-01 至 2027-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Checkpointing is a fundamental pattern used by a variety of scientific applications at both small and large computing scales. Widely adopted for resilience purposes by long-running applications (i.e., checkpoint-restart), it has seen an explosion of additional use cases that directly help applications progress faster and reduce time-to-solution even in the absence of failures: adjoint computations (essential in financial modeling, weather prediction, computational fluid dynamics, seismic imaging, and control theory) need to capture a history of checkpoints in a forward pass, which are then revisited in a backward pass. Training artificial intelligence models, increasingly used by scientific applications, often results in trajectories that do not lead to convergence or may lead to undesirable patterns, prompting the need to backtrack to an earlier checkpoint of the learning model to try an alternative. Transfer learning and fine-tuning using a previous checkpoint of a learning model can be used to adapt the training more quickly, avoiding expensive training from scratch. Many other use cases are important in scientific computing: suspend-resume (e.g., to preempt a long-running job in favor of a higher priority job), migration (checkpoint on one machine, restart on another), debugging (replay a problematic code region to reproduce errors without starting from scratch), and reproducibility (checkpoint and compare intermediate data during repeated runs). Despite broad applicability, current state-of-the-art solutions lack the flexibility, performance, and scalability needed to address these scenarios efficiently. The Versioned Lineage-Driven Checkpointing of Composable States (VLCC-States) project aims to fill this gap. It will streamline the development and use of checkpointing patterns for scientific applications, which simplifies and improves the reusability of integration efforts across different communities, improves awareness of the multitude of checkpointing scenarios, reduces development effort and cost, and enables flexible customization to extract the best performance and scalability for the desired application scenario.VLCC-States provides technical innovation in three areas. First, it introduces composable providers of intermediate states, which hide the complexity of capturing and assembling checkpoints of distributed data structures and their transformations across different modules and programming languages while optimizing their layout to eliminate redundancies, reduce sizes, and improve performance. Second, it provides multi-level co-optimized caching and prefetching techniques, which enable scalable management of the life cycle of checkpoints for interleavings of capture and reuse operations on heterogeneous storage stacks under concurrency. Third, it develops specialized checkpointing tools for large Artificial Intelligence models, with a focus on integration with PyTorch and DeepSpeed, to enable users to transparently take advantage of high-performance and scalable checkpointing using a familiar API. This project will engage partners in industry and national research laboratories to co-design VLCC-States, tune its capabilities, and evaluate its implementation. This project will undertake educational and broadening participation activities to improve community awareness and understanding of challenges in scientific data management.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
检查点是小型和大型计算规模的各种科学应用程序使用的基本模式。它被长期运行的应用程序(即检查点重启)广泛用于弹性目的,它已经看到了额外用例的爆炸式增长,即使在没有故障的情况下,也可以直接帮助应用程序更快地进展并缩短解决方案的时间:伴随计算(在金融建模、天气预报、计算流体动力学、地震成像和控制理论中至关重要)需要在前向传递中捕获检查点的历史记录,然后在后向传递中重新访问这些历史记录。训练人工智能模型越来越多地被科学应用所使用,通常会产生不会导致收敛或可能导致不良模式的轨迹,从而促使需要回溯到学习模型的早期检查点以尝试替代方案。使用学习模型的先前检查点进行迁移学习和微调可以更快地适应训练,从而避免从头开始进行昂贵的训练。许多其他用例在科学计算中也很重要:挂起-恢复(例如,抢占长时间运行的作业以支持更高优先级的作业)、迁移(在一台机器上设置检查点,在另一台机器上重新启动)、调试(重放有问题的代码)区域来重现错误而无需从头开始)和重现性(在重复运行期间检查点和比较中间数据)。尽管具有广泛的适用性,但当前最先进的解决方案缺乏有效解决这些场景所需的灵活性、性能和可扩展性。版本化沿袭驱动的可组合状态检查点 (VLCC-States) 项目旨在填补这一空白。它将简化科学应用程序检查点模式的开发和使用,从而简化和提高不同社区之间集成工作的可重用性,提高对多种检查点场景的认识,减少开发工作量和成本,并实现灵活定制以提取最佳性能VLCC-States 在三个领域提供技术创新。首先,它引入了可组合的中间状态提供程序,隐藏了捕获和组装分布式数据结构检查点及其跨不同模块和编程语言的转换的复杂性,同时优化其布局以消除冗余、减小大小并提高性能。其次,它提供了多级协同优化的缓存和预取技术,可以对检查点的生命周期进行可扩展的管理,以实现并发下异构存储堆栈上交错的捕获和重用操作。第三,它为大型人工智能模型开发专门的检查点工具,重点是与 PyTorch 和 DeepSpeed 的集成,使用户能够使用熟悉的 API 透明地利用高性能和可扩展的检查点。该项目将吸引行业和国家研究实验室的合作伙伴共同设计 VLCC-States、调整其能力并评估其实施情况。该项目将开展教育和扩大参与活动,以提高社区对科学数据管理挑战的认识和理解。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
M Mustafa Rafique其他文献
Optimization of data-intensive workflows in stream-based data processing models
基于流的数据处理模型中数据密集型工作流程的优化
- DOI:
10.1007/s11227-017-1991-0 - 发表时间:
2017-03-08 - 期刊:
- 影响因子:0
- 作者:
Saima Gulzar;Ahmad;Chee;Sun Liew;M Mustafa Rafique;Ehsan;Ullah Munir;B. Chee;M. M. Rafique;Ehsan Ullah Munir;S. G. Ahmad - 通讯作者:
S. G. Ahmad
M Mustafa Rafique的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('M Mustafa Rafique', 18)}}的其他基金
Collaborative Research: CNS Core: Medium:HardLambda: A new FaaS Abstraction for Cross-Stack Resource Management in Disaggregated Datacenters
协作研究:CNS 核心:Medium:HardLambda:分解数据中心跨堆栈资源管理的新 FaaS 抽象
- 批准号:
2106635 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
相似国自然基金
基于现代监测的湘西惹迷洞MIS2阶段石笋碳同位素和微量元素记录重建研究
- 批准号:42371164
- 批准年份:2023
- 资助金额:51 万元
- 项目类别:面上项目
基于元素偏聚的双相Mg-Li合金微纳组织构筑及强塑化机理研究
- 批准号:52371093
- 批准年份:2023
- 资助金额:51 万元
- 项目类别:面上项目
镧系硼基团簇中4f电子的键合特性与镧系元素反常价态的机理研究
- 批准号:12304296
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
压裂液诱导页岩化学元素迁移演化机理及重金属吸附治理研究
- 批准号:42307202
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于嫦娥五号样品的月球铁和钛元素定量反演研究
- 批准号:42303040
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: GEO-CM: The occurrences of the rare earth elements in highly weathered sedimentary rocks, Georgia kaolins.
合作研究:GEO-CM:强风化沉积岩、乔治亚高岭土中稀土元素的出现。
- 批准号:
2327659 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Elucidating the roles of biogenic exudates in the cycling and uptake of rare earth elements
合作研究:阐明生物渗出物在稀土元素循环和吸收中的作用
- 批准号:
2221883 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Elements: A Cyberlaboratory for Randomized Numerical Linear Algebra
合作研究:Elements:随机数值线性代数网络实验室
- 批准号:
2309446 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Elements: FaaSr: Enabling Cloud-native Event-driven Function-as-a-Service Computing Workflows in R
协作研究:要素:FaaSr:在 R 中启用云原生事件驱动的函数即服务计算工作流程
- 批准号:
2311123 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
American Urological Association 2023 Early Career Investigator Workshop
美国泌尿外科协会 2023 年早期职业研究者研讨会
- 批准号:
10754314 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别: