Collaborative Research: CISE: Large: Cross-Layer Resilience to Silent Data Corruption
协作研究:CISE:大型:针对静默数据损坏的跨层弹性
基本信息
- 批准号:2321489
- 负责人:
- 金额:$ 218.75万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2028-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Hyperscalers (i.e., large cloud service providers) are reporting frequent silent data corruptions (or SDCs) within their datacenter infrastructures. SDCs are software errors for which the only symptom is an incorrect result. Remarkably, SDCs at-scale exhibit error occurrence rates on the order of one thousand faults per one million devices. Meanwhile, hardware manufacturers strive to achieve one hundred and close to zero defective parts per million for the commercial and automotive domains, respectively. This discrepancy between manufacturers’ goals and hyperscalers’ observations suggests that SDCs are a real threat to the reliability of all modern computing systems, and by extension their security and sustainability. This project explores whether it is possible to cooperatively design testing, detection, and mitigation approaches for SDCs that minimize performance impact on software applications, as well as additional carbon footprint expenditures associated with manufacturing and running computing systems. The project’s key novelties include: (1) leveraging reoccurring computational primitives in software (e.g., matrix multiplication in popular machine learning applications) and modern special-purpose hardware (e.g., Artificial Intelligence processors) to design domain-specific SDC solutions; (2) exploiting the fact that SDC testing can be performed throughout a device’s lifetime in the datacenter rather than for a few seconds to minutes — a strict limitation on the manufacturing test floor; (3) considering sustainability and carbon footprint as a core design metric. This project’s core impact will be a critical improvement in reliability and security for the countless applications to which we entrust computing systems today. A secondary core impact is an improvement in the longevity of computing devices, which has significant positive implications for sustainable computing. The research team will also train students and work with industry partners. To address the SDC challenge, the research team pursues four synergistic research thrusts that cut across diverse domains: Silicon Devices, Computer Architecture, Software, and Algorithms. Within each thrust, the team will study the SDC challenge through the lenses of: Testing, Detection, Mitigation, and Security implications. Thrust 1 explores device-level testing through novel test pattern metrics and continuous scan test deployment. Thrust 2 studies system-level testing (improving error detection latency and test coverage and adapting tests to be more representative of datacenter workloads), core-specific testing, defect characterization, hardware support for testing and mitigation, and system security implications. Thrust 3 investigates software detection and mitigation through (partial) redundancy, appropriate scan and system-level test scheduling, test-application fusion (where applications test themselves), and software security hardening against defect-induced vulnerabilities. Thrust 4 pursues algorithmic detection and mitigations with a particular emphasis on enabling robust non-linear computation for important datacenter workloads, like neural networks.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
高级标准(即大型云服务提供商)正在其数据中心基础架构中经常报告无声数据更正(或SDC)。 SDC是软件错误,唯一的症状是不正确的结果。值得注意的是,SDC处于每百万个设备的1000个故障的订单上,出现了尺度杀戮错误。同时,硬件制造商努力为商业和汽车域分别取得百万分之零的零零件。制造商的目标与高分标观察之间的这种差异表明,SDC是对所有现代计算系统的可靠性以及扩展其安全性和可持续性的真正威胁。该项目探讨了是否有可能为SDC进行协调设计测试,检测和缓解方法,从而最大程度地减少对软件应用的影响,以及其他碳足迹,从而利用了可以在整个设备一生中在Datacenter的一生中进行SDC测试的事实,而不是在几秒钟内,而不是在制造测试楼层上进行严格的限制; (3)将可持续性和碳足迹视为核心设计指标。该项目的核心影响将是我们今天投资计算系统的无数应用程序的可靠性和安全性的重要提高。次要核心影响是计算设备寿命的改善,这对可持续计算具有重大的积极影响。研究团队还将培训学生并与行业合作伙伴合作。为了应对SDC挑战赛,研究团队追求四个协同研究推力,这些研究削减了潜水领域:硅设备,计算机架构,软件和算法。在每个推力中,团队将通过:测试,检测,缓解和安全性的镜头来研究SDC挑战。推力1通过新颖的测试模式指标和连续扫描测试部署探索设备级测试。推力2研究系统级测试(改善错误检测潜伏期和测试覆盖范围和调整测试,以代表数据中心工作负载),核心特异性测试,缺陷表征,用于测试和缓解的硬件支持以及系统安全含义。推力3通过(部分)冗余,适当的扫描和系统级测试调度,测试应用融合(在应用程序测试中)以及软件安全性硬化针对缺陷引起的漏洞来研究软件检测和缓解软件。推力4追求算法检测和缓解措施,特别着重于为重要的数据中心工作负载(如神经网络)实现强大的非线性计算。该奖项反映了NSF的法定任务,并通过使用基金会的知识分子优点和更广泛的影响审查标准来评估被认为是宝贵的支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Caroline Trippel其他文献
Concurrency and Security Verification in Heterogeneous Parallel Systems
- DOI:
- 发表时间:
2019 - 期刊:
- 影响因子:7.9
- 作者:
Caroline Trippel - 通讯作者:
Caroline Trippel
TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests
TransForm:正式指定瞬态模型并综合增强的石蕊测试
- DOI:
10.1109/isca45697.2020.00076 - 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Naorin Hossain;Caroline Trippel;M. Martonosi - 通讯作者:
M. Martonosi
NL2FOL: Translating Natural Language to First-Order Logic for Logical Fallacy Detection
NL2FOL:将自然语言转换为一阶逻辑以进行逻辑谬误检测
- DOI:
10.48550/arxiv.2405.02318 - 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Abhinav Lalwani;Lovish Chopra;Christopher Hahn;Caroline Trippel;Zhijing Jin;Mrinmaya Sachan - 通讯作者:
Mrinmaya Sachan
Exploring the Trisection of Software, Hardware, and ISA in Memory Model Design
探索内存模型设计中软件、硬件和 ISA 的三分法
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Caroline Trippel;Yatin A. Manerkar;Daniel Lustig;Michael Pellauer;M. Martonosi - 通讯作者:
M. Martonosi
TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA
TriCheck:软件、硬件和 ISA 三部分的内存模型验证
- DOI:
10.1145/3037697.3037719 - 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Caroline Trippel;Yatin A. Manerkar;Daniel Lustig;Michael Pellauer;M. Martonosi - 通讯作者:
M. Martonosi
Caroline Trippel的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Caroline Trippel', 18)}}的其他基金
CAREER: Scalable Assurance via Verifiable Hardware-Software Contracts
职业:通过可验证的硬件软件合同提供可扩展的保证
- 批准号:
2236855 - 财政年份:2023
- 资助金额:
$ 218.75万 - 项目类别:
Continuing Grant
Collaborative Research: SaTC: CORE: Medium: Systematic Detection Of and Defenses Against Next-Generation Microarchitectural Attacks
协作研究:SaTC:核心:中:下一代微架构攻击的系统检测和防御
- 批准号:
2153936 - 财政年份:2022
- 资助金额:
$ 218.75万 - 项目类别:
Continuing Grant
FMitF: Track II: Scaling Formal Hardware Security Verification with CheckMate from Research to Practice
FMITF:轨道 II:使用 CheckMate 将正式硬件安全验证从研究扩展到实践
- 批准号:
2017863 - 财政年份:2020
- 资助金额:
$ 218.75万 - 项目类别:
Standard Grant
相似国自然基金
支持二维毫米波波束扫描的微波/毫米波高集成度天线研究
- 批准号:62371263
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
腙的Heck/脱氮气重排串联反应研究
- 批准号:22301211
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
水系锌离子电池协同性能调控及枝晶抑制机理研究
- 批准号:52364038
- 批准年份:2023
- 资助金额:33 万元
- 项目类别:地区科学基金项目
基于人类血清素神经元报告系统研究TSPYL1突变对婴儿猝死综合征的致病作用及机制
- 批准号:82371176
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
FOXO3 m6A甲基化修饰诱导滋养细胞衰老效应在补肾法治疗自然流产中的机制研究
- 批准号:82305286
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: CISE: Large: Cross-Layer Resilience to Silent Data Corruption
协作研究:CISE:大型:针对静默数据损坏的跨层弹性
- 批准号:
2321492 - 财政年份:2023
- 资助金额:
$ 218.75万 - 项目类别:
Continuing Grant
Collaborative Research: CISE: Large: Integrated Networking, Edge System and AI Support for Resilient and Safety-Critical Tele-Operations of Autonomous Vehicles
合作研究:CISE:大型:集成网络、边缘系统和人工智能支持自动驾驶汽车的弹性和安全关键远程操作
- 批准号:
2321531 - 财政年份:2023
- 资助金额:
$ 218.75万 - 项目类别:
Continuing Grant
Collaborative Research: Conference: 2023 CISE Education and Workforce PI and Community Meeting
协作研究:会议:2023 年 CISE 教育和劳动力 PI 和社区会议
- 批准号:
2318593 - 财政年份:2023
- 资助金额:
$ 218.75万 - 项目类别:
Standard Grant
Collaborative Research: Conference: 2023 CISE Education and Workforce PI and Community Meeting
协作研究:会议:2023 年 CISE 教育和劳动力 PI 和社区会议
- 批准号:
2318592 - 财政年份:2023
- 资助金额:
$ 218.75万 - 项目类别:
Standard Grant
Collaborative Research: CISE-MSI: RCBP-ED: CCRI: TechHouse Partnership to Increase the Computer Engineering Research Expansion at Morehouse College
合作研究:CISE-MSI:RCBP-ED:CCRI:TechHouse 合作伙伴关系,以促进莫尔豪斯学院计算机工程研究扩展
- 批准号:
2318703 - 财政年份:2023
- 资助金额:
$ 218.75万 - 项目类别:
Standard Grant