Collaborative Research: CNS Core: Medium:HardLambda: A new FaaS Abstraction for Cross-Stack Resource Management in Disaggregated Datacenters

协作研究：CNS 核心：Medium：HardLambda：分解数据中心跨堆栈资源管理的新 FaaS 抽象

基本信息

批准号：
2106634
负责人：
Ali Butt
金额：
$ 42万
依托单位：
Virginia Polytechnic Institute and State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-06-01 至 2025-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2106634&HistoricalAwards=false
关键词：
Collaborative Research CNS Core Medium

项目摘要

Datacenters use computer servers that are no longer able to address the performance and scaling demands of emerging applications such as those in healthcare, smart infrastructure design, and high-speed physics. There is a fundamental mismatch between the capabilities of traditionally designed servers and the dynamic requirements of modern applications. This mismatch leads to poor utilization and significant waste of resources. A new way to design datacenters, called the disaggregated approach, can address this problem by creating a need-based on-demand model for computing. Here, servers are specialized to perform specific functions, and applications use only those specialized servers that best perform the functions needed by each application. While the disaggregated approach improves utilization and makes datacenters easier to manage, it comes at a performance cost: disaggregation requires applications to access critical resources spread across a set of specialized servers over the datacenter network. To mitigate such challenges of resource disaggregation, this project designs HardLambda, a new Function-as-a-Service (FaaS) abstraction that brings the functional and hardware requirements of an application together in a unified fashion. HardLambda enables datacenters to allocate resources in ways that best meet application needs while retaining the resource utilization and management flexibility of disaggregated hardware. The designed algorithms and system software will enable scalable control and sharing of disaggregated resources, and create new approaches to adaptive resource allocation. HardLambda will make disaggregated datacenters a viable and sustainable option for numerous applications in science and industry. The project especially targets machine and deep learning (ML/DL) applications due to their increasingly crucial role in many aspects of modern computing-powered life. At the same time, HardLambda will improve the sustainability of large-scale datacenters, where high utilization, efficiency, and continuous adaptation to application requirements are all essential factors. The research will create new knowledge on hardware and software co-designed FaaS systems and services, and yield insights for efficiently supporting ML/DL applications at extremely large scales. The project will engage with partners in industry and national research laboratories to deploy HardLambda in real systems and will undertake educational and broadening participation activities to improve community awareness and understanding of the scaling and sustainability challenges of large-scale computing infrastructure. Special emphasis will be given to engaging students from underrepresented groups in the research and educational activities.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

数据中心使用的计算机服务器不再能够解决新兴应用程序的性能和扩展需求，例如医疗保健，智能基础设施设计和高速物理学。传统设计的服务器的功能与现代应用的动态要求之间存在根本的不匹配。这种不匹配导致利用率不佳和资源的大量浪费。设计数据中心的一种新方法，称为分类方法，可以通过创建基于需求的按需模型来解决此问题。在这里，服务器专门执行特定功能，并且应用程序仅使用那些最能执行每个应用程序所需功能的专门服务器。尽管分类方法可以改善利用率并使数据中心更易于管理，但它以性能成本进行：分类需要应用程序访问分布在数据中心网络上的一组专用服务器的关键资源。为了减轻资源分类的这种挑战，该项目设计了Hardlambda，这是一种新的功能 - 服务（FAAS）抽象，以统一的方式将应用程序的功能和硬件要求融合在一起。 Hardlambda使数据中心能够以最能满足应用程序需求的方式分配资源，同时保留分类硬件的资源利用和管理灵活性。设计的算法和系统软件将启用可扩展的控制和共享分解资源，并创建新方法来自适应资源分配。 Hardlambda将使分解数据中心成为科学和工业中众多应用程序的可行和可持续选择。该项目尤其针对机器和深度学习（ML/DL）应用程序，因为它们在现代计算驱动的生活的许多方面都越来越重要。同时，Hardlambda将提高大规模数据中心的可持续性，在这种情况下，高利用，效率和对应用程序要求的持续适应都是基本因素。该研究将创建有关硬件和软件共同设计的FAAS系统和服务的新知识，并产生见解，以有效地支持非常大的ML/DL应用程序。该项目将与行业和国家研究实验室的合作伙伴互动，以在实际系统中部署Hardlambda，并将开展教育和扩大参与活动，以提高社区意识，并了解大规模计算基础设施的规模和可持续性挑战。将特别强调来自代表性不足小组的研究和教育活动的吸引学生。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的知识分子优点和更广泛的影响评估标准通过评估来获得支持的。

项目成果

期刊论文数量（23）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Tokenized Incentive for Federated Learning.

联邦学习的代币化激励。

DOI：
发表时间：
2022
期刊：
Verifiable and Auditable Federated Learning (FL-AAAI-22
影响因子：
0
作者：
Han, Jingoo;Khan, Ahmad Faraz;Zawad, Syed;Anwar, Ali;Angel, Nathalie Baracaldo;Zhou, Yi;Butt, Ali R.
通讯作者：
Butt, Ali R.

COLTI: Towards Concurrent and Co-located DNN Training and Inference

COLTI：迈向并发和同地 DNN 训练和推理

DOI：
10.1145/3588195.3595940
发表时间：
2023
期刊：
ACM
影响因子：
0
作者：
Mobin, Jaiaid;Maurya, Avinash;Rafique, M. Mustafa
通讯作者：
Rafique, M. Mustafa

AI-driven Storage Resource Provisioning and Operations: Revisiting Old Assumptions and Meeting New Expectations.

人工智能驱动的存储资源配置和运营：重新审视旧假设并满足新期望。

DOI：
发表时间：
2022
期刊：
Proceedings of the ASCR Workshop on the Management and Storage of Scientific Data
影响因子：
0
作者：
Anantharaj, Valentine;da Silva, Rafael Ferreira;Butt, Ali R.;Oral, Sarp;Tiwari. Devesh
通讯作者：
Tiwari. Devesh

Translation-optimized Memory Compression for Capacity

DOI：
10.1109/micro56248.2022.00073
发表时间：
2022-10
期刊：
2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
影响因子：
0
作者：
Gagandeep Panwar;Muhammad Laghari;D. Bears;Yuqing Liu;Chandler Jearls;Esha Choukse;K. Cameron;A. Butt;Xun Jian
通讯作者：
Gagandeep Panwar;Muhammad Laghari;D. Bears;Yuqing Liu;Chandler Jearls;Esha Choukse;K. Cameron;A. Butt;Xun Jian

SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

DOI：
发表时间：
2023
期刊：
影响因子：
0
作者：
Redwan Ibne Seraj Khan;Ahmad Hossein Yazdani;Yuqi Fu;Arnab K. Paul;Bo Ji;Xun Jian;Yue Cheng;A. R. Butt
通讯作者：
Redwan Ibne Seraj Khan;Ahmad Hossein Yazdani;Yuqi Fu;Arnab K. Paul;Bo Ji;Xun Jian;Yue Cheng;A. R. Butt