Collaborative Research: PPoSS: LARGE: ScaleStuds: Foundations for Correctness Checkability and Performance Predictability of Systems at Scale

合作研究：PPoSS：大型：ScaleStuds：大规模系统正确性可检查性和性能可预测性的基础

基本信息

批准号：
2119348
负责人：
Cindy Rubio Gonzalez
金额：
$ 62.5万
依托单位：
University of California-Davis
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-10-01 至 2026-09-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2119348&HistoricalAwards=false
关键词：
Collaborative Research PPoSS LARGE ScaleStuds

Collaborative Research PPoSS LARGE ScaleStuds

项目摘要

In light of the limits of Moore's Law and Dennard scaling and the ever increasing computing demand, the last decade has seen unprecedented deployment scales; Google is known to run clusters with thousands of machines each, Apple deploys a total of 100,000 database machines, and Netflix runs tens of database clusters with 500 nodes each. This era of extreme-scale distributed systems has given birth to a new class of faults, "scalability faults" -- complex latent faults that are scale-dependent, whose symptoms surface in large-scale deployments but not necessarily in small/medium-scale deployments. Many fundamental research questions are not answerable today. On correctness: How to detect bugs that only manifest under large scale through program analysis? How to test and reproduce various dimensions of system scales efficiently on one machine? How to prevent and fix scalability-related faults? On performance: How to reason about software performance on various heterogeneous devices? How to accurately predict performance of fine-grained tasks to reduce inaccuracies at the aggregate level and project performance to future architectures? Finally, in combination: How to answer all these questions for the larger connected ecosystem -- not just the individual software and hardware components -- and to eventually build future-generation systems that are reproducible and verifiable by construction with respect to correctness and performance at scale? The ScaleStuds project involves a team of ten researchers to develop the foundations of correctness checkability (CC) and performance predictability (PP) of systems at scale. The key principle of this project is to "check large with large" -- check large-scale systems with a large fleet of data, analysis, tests, learning, models, and proofs. The vision is to build an ecosystem of distributed "CC+PP-certified" software-software and -hardware interactions. The project is paving the vision one "floor" at a time, creating composable building blocks ("the studs"). The project first builds new mechanisms such as a scale-testing platform and a unified database of software program properties and hardware performance profiles exposing clear APIs. These studs then enable multi-dimensional automated scalability tests and program analysis and performance learning and prediction at various levels of the software/hardware stack. Ultimately all of these experiences are intended to lead to correct and performant cross-layer/service interactions and future design principles including reproducible- and verified-by-construction development methods. The project novelties include the advancement of debugging, testing, learning, and prediction methods to ensure correctness checkability and performance predictability of extreme-scale systems and applications both on classical hardware platforms and emerging ones; a unified data ecosystem of software/hardware properties and profiles that facilitates automated analyses via clear APIs; a multi-dimensional scale-testing framework that empowers the development of new large-scale unit-tests and program analysis; detailed device profiling and observation to enable large-scale performance learning/prediction and deliver lessons for learning/predicting the behavior of other devices and layers in an end-to-end hardware/software stack; and ultimately a clear definition of CC+PP-certifiability for today's systems and future verifiable/reproducible-by-construction development methods.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

鉴于摩尔定律和丹纳德缩放的局限性以及计算需求不断增加的限制，过去十年来已经看到了前所未有的部署量表。众所周知，Google可以运行每台数千台机器，Apple总共部署了100,000台数据库机，Netflix运行数十个数据库簇，每个数据库群集每个有500个节点。极端尺度分布系统的这个时代已经诞生了一系列新的故障，“可伸缩性故障” - 相关的复杂潜在故障，其症状在大规模部署中表现出来，但不一定是在中小型/中等规模的部署中。如今，许多基本的研究问题是无法回答的。关于正确性：如何检测仅通过程序分析在大规模下表现出来的错误？如何在一台机器上有效地测试和重现系统尺度的各个维度？如何预防和修复与可伸缩性相关的故障？性能：如何推理各种异质设备上的软件性能？如何准确预测细粒度的任务的性能以减少总级别的不准确性和项目绩效为未来的体系结构？最后，结合结合：如何为较大的连接生态系统（不仅是单个软件和硬件组件）回答所有这些问题，并最终构建未来生成的系统，这些系统可以通过规模上的正确性和性能来重现和可证实吗？ ScaleStuds项目涉及十个研究人员的团队，以规模规模开发系统的正确性可检查性（CC）和性能可预测性（PP）的基础。该项目的关键原则是“大大检查大型” - 检查具有大量数据，分析，测试，学习，模型和证明的大型系统。愿景是建立一个分布式“ CC+PP认证”软件软件和 - hardware交互的生态系统。该项目一次一次铺平了愿景“地板”，创建了可组合的构件（“螺柱”）。该项目首先构建了新机制，例如比例测试平台以及软件程序属性和硬件性能配置文件的统一数据库，以公开清晰的API。然后，这些螺柱启用了多维自动化可伸缩性测试，并且在软件/硬件堆栈的各个级别上进行了程序分析以及性能学习和预测。最终，所有这些经验都旨在导致纠正和性能的跨层/服务交互以及未来的设计原则，包括可再现和逐步构建开发方法。该项目的新颖性包括在经典硬件平台和新兴的系统上的极端尺度系统和应用程序的正确性可检查性和性能可预测性的进步，以确保正确性可校可检查性和性能可预测性；软件/硬件属性和配置文件的统一数据生态系统，可通过清晰的API进行自动分析；一个多维规模测试框架，旨在开发新的大型单位测试和程序分析；详细的设备分析和观察，以实现大规模的性能学习/预测，并提供课程，以学习/预测端到端硬件/软件堆栈中其他设备和层的行为；最终，对CC+PP可确定性的明确定义对当今系统以及未来的可验证/可重现的通过构造开发方法进行了清晰的定义。该奖项反映了NSF的法定任务，并被认为是值得通过基金会的智力优点和更广泛影响的评估标准通过评估来进行评估的。