Collaborative Research: PPoSS: LARGE: ScaleStuds: Foundations for Correctness Checkability and Performance Predictability of Systems at Scale

合作研究:PPoSS:大型:ScaleStuds:大规模系统正确性可检查性和性能可预测性的基础

基本信息

  • 批准号:
    2119348
  • 负责人:
  • 金额:
    $ 62.5万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-10-01 至 2026-09-30
  • 项目状态:
    未结题

项目摘要

In light of the limits of Moore's Law and Dennard scaling and the ever increasing computing demand, the last decade has seen unprecedented deployment scales; Google is known to run clusters with thousands of machines each, Apple deploys a total of 100,000 database machines, and Netflix runs tens of database clusters with 500 nodes each. This era of extreme-scale distributed systems has given birth to a new class of faults, "scalability faults" -- complex latent faults that are scale-dependent, whose symptoms surface in large-scale deployments but not necessarily in small/medium-scale deployments. Many fundamental research questions are not answerable today. On correctness: How to detect bugs that only manifest under large scale through program analysis? How to test and reproduce various dimensions of system scales efficiently on one machine? How to prevent and fix scalability-related faults? On performance: How to reason about software performance on various heterogeneous devices? How to accurately predict performance of fine-grained tasks to reduce inaccuracies at the aggregate level and project performance to future architectures? Finally, in combination: How to answer all these questions for the larger connected ecosystem -- not just the individual software and hardware components -- and to eventually build future-generation systems that are reproducible and verifiable by construction with respect to correctness and performance at scale? The ScaleStuds project involves a team of ten researchers to develop the foundations of correctness checkability (CC) and performance predictability (PP) of systems at scale. The key principle of this project is to "check large with large" -- check large-scale systems with a large fleet of data, analysis, tests, learning, models, and proofs. The vision is to build an ecosystem of distributed "CC+PP-certified" software-software and -hardware interactions. The project is paving the vision one "floor" at a time, creating composable building blocks ("the studs"). The project first builds new mechanisms such as a scale-testing platform and a unified database of software program properties and hardware performance profiles exposing clear APIs. These studs then enable multi-dimensional automated scalability tests and program analysis and performance learning and prediction at various levels of the software/hardware stack. Ultimately all of these experiences are intended to lead to correct and performant cross-layer/service interactions and future design principles including reproducible- and verified-by-construction development methods. The project novelties include the advancement of debugging, testing, learning, and prediction methods to ensure correctness checkability and performance predictability of extreme-scale systems and applications both on classical hardware platforms and emerging ones; a unified data ecosystem of software/hardware properties and profiles that facilitates automated analyses via clear APIs; a multi-dimensional scale-testing framework that empowers the development of new large-scale unit-tests and program analysis; detailed device profiling and observation to enable large-scale performance learning/prediction and deliver lessons for learning/predicting the behavior of other devices and layers in an end-to-end hardware/software stack; and ultimately a clear definition of CC+PP-certifiability for today's systems and future verifiable/reproducible-by-construction development methods.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
鉴于摩尔定律和登纳德缩放定律的限制以及不断增长的计算需求,过去十年出现了前所未有的部署规模;众所周知,Google 运行着每个包含数千台机器的集群,Apple 总共部署了 100,000 台数据库机器,Netflix 运行着数十个数据库集群,每个集群有 500 个节点。 这个超大规模分布式系统时代催生了一类新的故障,即“可扩展性故障”——与规模相关的复杂潜在故障,其症状在大规模部署中显现,但不一定在中小规模部署中显现。部署。 今天,许多基础研究问题都无法回答。 关于正确性:如何通过程序分析来检测大规模下才会出现的错误?如何在一台机器上高效地测试和重现系统规模的各个维度?如何预防和修复与可扩展性相关的故障? 关于性能:如何推理各种异构设备上的软件性能?如何准确预测细粒度任务的性能,以减少总体水平的不准确性以及未来架构的项目性能? 最后,结合起来:如何为更大的互联生态系统(而不仅仅是单个软件和硬件组件)回答所有这些问题,并最终构建可通过构建在正确性和性能方面可复制和验证的未来一代系统。规模? ScaleStuds 项目由十名研究人员组成的团队负责开发大规模系统的正确性可检查性 (CC) 和性能可预测性 (PP) 的基础。该项目的关键原则是“用大来检查”——用大量数据、分析、测试、学习、模型和证明来检查大型系统。 愿景是建立一个分布式“CC+PP认证”软硬件交互的生态系统。 该项目一次将愿景铺成一层“地板”,创建可组合的构建块(“螺柱”)。 该项目首先构建了新的机制,例如规模测试平台以及公开清晰 API 的软件程序属性和硬件性能配置文件的统一数据库。 然后,这些螺柱可以在软件/硬件堆栈的各个级别上实现多维自动化可扩展性测试和程序分析以及性能学习和预测。 最终,所有这些经验都旨在带来正确且高性能的跨层/服务交互和未来的设计原则,包括可重复和可验证的构建开发方法。 该项目的新颖之处包括调试、测试、学习和预测方法的进步,以确保经典硬件平台和新兴硬件平台上极端规模系统和应用程序的正确性可检查性和性能可预测性;一个由软件/硬件属性和配置文件组成的统一数据生态系统,可通过清晰的 API 促进自动分析;多维规模测试框架,支持开发新的大规模单元测试和程序分析;详细的设备分析和观察,以实现大规模性能学习/预测,并为学习/预测端到端硬件/软件堆栈中其他设备和层的行为提供课程;并最终为当今的系统和未来可验证/可复制的开发方法明确定义了 CC+PP 的可验证性。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查进行评估,被认为值得支持标准。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Cindy Rubio Gonzalez其他文献

Cindy Rubio Gonzalez的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Cindy Rubio Gonzalez', 18)}}的其他基金

Collaborative Research: DOE/NSF Workshop on Correctness in Scientific Computing
合作研究:DOE/NSF 科学计算正确性研讨会
  • 批准号:
    2319663
  • 财政年份:
    2023
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Standard Grant
CCRI: ENS: BugSwarm: Enhancing an Infrastructure and Dataset to Support the Software Engineering Research Community
CCRI:ENS:BugSwarm:增强基础设施和数据集以支持软件工程研究社区
  • 批准号:
    2016735
  • 财政年份:
    2020
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Standard Grant
CAREER: Understanding and Combating Numerical Bugs for Reliable and Efficient Software Systems
职业:理解和对抗数字错误以实现可靠和高效的软件系统
  • 批准号:
    1750983
  • 财政年份:
    2018
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Continuing Grant
CI-New: BugSwarm: A Large-Scale Repository of Replicable Defects, Tests, and Patches to Support the Software Engineering Research Community
CI-New:BugSwarm:支持软件工程研究社区的可复制缺陷、测试和补丁的大型存储库
  • 批准号:
    1629976
  • 财政年份:
    2016
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Standard Grant
CRII: SHF: Automatic Extraction of Error-Handling Specifications in Systems Software
CRII:SHF:系统软件中错误处理规范的自动提取
  • 批准号:
    1464439
  • 财政年份:
    2015
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Standard Grant

相似国自然基金

离子型稀土渗流-应力-化学耦合作用机理与溶浸开采优化研究
  • 批准号:
    52364012
  • 批准年份:
    2023
  • 资助金额:
    32 万元
  • 项目类别:
    地区科学基金项目
亲环蛋白调控作物与蚜虫互作分子机制的研究
  • 批准号:
    32301770
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于金属-多酚网络衍生多相吸波体的界面调控及电磁响应机制研究
  • 批准号:
    52302362
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
职场网络闲逛行为的作用结果及其反馈效应——基于行为者和观察者视角的整合研究
  • 批准号:
    72302108
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
EIF6负调控Dicer活性促进EV71复制的分子机制研究
  • 批准号:
    32300133
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale
协作研究:PPoSS:大型:大规模声明性分析的全栈方法
  • 批准号:
    2316161
  • 财政年份:
    2023
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: PPoSS: LARGE: Research into the Use and iNtegration of Data Movement Accelerators (RUN-DMX)
协作研究:PPoSS:大型:数据移动加速器 (RUN-DMX) 的使用和集成研究
  • 批准号:
    2316176
  • 财政年份:
    2023
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale
协作研究:PPoSS:大型:大规模声明性分析的全栈方法
  • 批准号:
    2316158
  • 财政年份:
    2023
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: PPoSS: LARGE: Cross-layer Coordination and Optimization for Scalable and Sparse Tensor Networks (CROSS)
合作研究:PPoSS:LARGE:可扩展和稀疏张量网络的跨层协调和优化(CROSS)
  • 批准号:
    2316201
  • 财政年份:
    2023
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Standard Grant
Collaborative Research: PPoSS: LARGE: Cross-layer Coordination and Optimization for Scalable and Sparse Tensor Networks (CROSS)
合作研究:PPoSS:LARGE:可扩展和稀疏张量网络的跨层协调和优化(CROSS)
  • 批准号:
    2316203
  • 财政年份:
    2023
  • 资助金额:
    $ 62.5万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了