Collaborative Research: CNS Core: Medium:HardLambda: A new FaaS Abstraction for Cross-Stack Resource Management in Disaggregated Datacenters

协作研究：CNS 核心：Medium：HardLambda：分解数据中心跨堆栈资源管理的新 FaaS 抽象

基本信息

批准号：
2106635
负责人：
M Mustafa Rafique
金额：
$ 18万
依托单位：
Rochester Institute of Tech
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-06-01 至 2025-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2106635&HistoricalAwards=false
关键词：
Collaborative Research CNS Core Medium

项目摘要

Datacenters use computer servers that are no longer able to address the performance and scaling demands of emerging applications such as those in healthcare, smart infrastructure design, and high-speed physics. There is a fundamental mismatch between the capabilities of traditionally designed servers and the dynamic requirements of modern applications. This mismatch leads to poor utilization and significant waste of resources. A new way to design datacenters, called the disaggregated approach, can address this problem by creating a need-based on-demand model for computing. Here, servers are specialized to perform specific functions, and applications use only those specialized servers that best perform the functions needed by each application. While the disaggregated approach improves utilization and makes datacenters easier to manage, it comes at a performance cost: disaggregation requires applications to access critical resources spread across a set of specialized servers over the datacenter network. To mitigate such challenges of resource disaggregation, this project designs HardLambda, a new Function-as-a-Service (FaaS) abstraction that brings the functional and hardware requirements of an application together in a unified fashion. HardLambda enables datacenters to allocate resources in ways that best meet application needs while retaining the resource utilization and management flexibility of disaggregated hardware. The designed algorithms and system software will enable scalable control and sharing of disaggregated resources, and create new approaches to adaptive resource allocation. HardLambda will make disaggregated datacenters a viable and sustainable option for numerous applications in science and industry. The project especially targets machine and deep learning (ML/DL) applications due to their increasingly crucial role in many aspects of modern computing-powered life. At the same time, HardLambda will improve the sustainability of large-scale datacenters, where high utilization, efficiency, and continuous adaptation to application requirements are all essential factors. The research will create new knowledge on hardware and software co-designed FaaS systems and services, and yield insights for efficiently supporting ML/DL applications at extremely large scales. The project will engage with partners in industry and national research laboratories to deploy HardLambda in real systems and will undertake educational and broadening participation activities to improve community awareness and understanding of the scaling and sustainability challenges of large-scale computing infrastructure. Special emphasis will be given to engaging students from underrepresented groups in the research and educational activities.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

数据中心使用的计算机服务器不再能够满足新兴应用程序（例如医疗保健、智能基础设施设计和高速物理领域的应用程序）的性能和扩展需求。传统设计的服务器的功能与现代应用程序的动态要求之间存在根本性的不匹配。这种不匹配导致利用率低下和资源的严重浪费。一种称为分解方法的数据中心设计新方法可以通过创建基于需求的按需计算模型来解决此问题。在这里，服务器专门用于执行特定功能，而应用程序仅使用那些能够最好地执行每个应用程序所需功能的专用服务器。虽然分解方法提高了利用率并使数据中心更易于管理，但它会带来性能成本：分解要求应用程序访问分布在数据中心网络上一组专用服务器上的关键资源。为了缓解资源分解的此类挑战，该项目设计了 HardLambda，这是一种新的功能即服务 (FaaS) 抽象，它以统一的方式将应用程序的功能和硬件需求整合在一起。 HardLambda 使数据中心能够以最能满足应用程序需求的方式分配资源，同时保留分解硬件的资源利用率和管理灵活性。设计的算法和系统软件将实现分类资源的可扩展控制和共享，并创建自适应资源分配的新方法。 HardLambda 将使分类数据中心成为科学和工业领域众多应用程序的可行且可持续的选择。该项目特别针对机器和深度学习 (ML/DL) 应用程序，因为它们在现代计算驱动的生活的许多方面发挥着越来越重要的作用。同时，HardLambda 将提高大规模数据中心的可持续性，其中高利用率、效率和持续适应应用程序需求都是重要因素。该研究将创造有关硬件和软件共同设计的 FaaS 系统和服务的新知识，并产生有效支持超大规模的 ML/DL 应用程序的见解。该项目将与行业和国家研究实验室的合作伙伴合作，在实际系统中部署 HardLambda，并将开展教育和扩大参与活动，以提高社区对大规模计算基础设施的扩展和可持续性挑战的认识和理解。将特别重视让代表性不足群体的学生参与研究和教育活动。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（30）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

TIFF: Tokenized Incentive for Federated Learning.

TIFF：联邦学习的代币化激励。

DOI：
10.1109/cloud55607.2022.00064
发表时间：
2022-07
期刊：
Proceedings of the IEEE International Conference on Cloud Computing (CLOUD
影响因子：
0
作者：
Han, Jingoo;Khan, Ahmad Faraz;Zawad, Syed;Anwar, Ali;Angel, Nathalie Baracaldo;Zhou, Yi;Yan, Feng;Butt, Ali R.
通讯作者：
Butt, Ali R.

SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

SHADE：为分布式深度学习训练提供基本的缓存能力

DOI：
发表时间：
2023-02
期刊：
21st USENIX Conference on File and Storage Technologies (FAST
影响因子：
0
作者：
Khan Redwan Ibne Seraj;Yazdani, Ahmad H.;Fu, Yuqi;Paul, Arnab K.;Ji, Bo;Jian, Xun;Cheng, Yue;Butt, Ali R.
通讯作者：
Butt, Ali R.

Application-Attuned Memory Management for Containerized HPC Workflows

适用于容器化 HPC 工作流程的应用程序协调内存管理

DOI：
发表时间：
2024-05
期刊：
IEEE International Parallel & Distributed Processing Symposium (IPDPS
影响因子：
0
作者：
Arif, Moiz;Maurya, Avinash;Rafique, M. Mustafa;Nikolopoulos, Dimitrios S.;Butt, Ali R.
通讯作者：
Butt, Ali R.

Towards Efficient Python Interpreter for Tiered Memory Systems

面向分层内存系统的高效 Python 解释器

DOI：
发表时间：
2024-02
期刊：
Poster and Work-in-Progress in Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST
影响因子：
0
作者：
Li, Yuze;Yao, Shunyu;Mobin, Jaiaid;Rafique, M. Mustafa;Nikolopoulos, Dimitrios;Sundararajah, Kirshanthan;Li, Huaicheng;Butt, Ali R
通讯作者：
Butt, Ali R

Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

使用缓存感知交错优化共置深度学习模型的训练

DOI：
发表时间：
2023-12
期刊：
and Analytics (HiPC
影响因子：
0
作者：
Assogba, Kevin;Nicolae, Bogdan;Rafique, M. Mustafa
通讯作者：
Rafique, M. Mustafa

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

M Mustafa Rafique其他文献

Optimization of data-intensive workflows in stream-based data processing models

基于流的数据处理模型中数据密集型工作流程的优化

DOI：
10.1007/s11227-017-1991-0
发表时间：
2017-03-08
期刊：
The Journal of Supercomputing
影响因子：
0
作者：
Saima Gulzar;Ahmad;Chee;Sun Liew;M Mustafa Rafique;Ehsan;Ullah Munir;B. Chee;M. M. Rafique;Ehsan Ullah Munir;S. G. Ahmad
通讯作者：
S. G. Ahmad