CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability

职业:通过利用和增强系统可观测性迈向灰色容错云

基本信息

  • 批准号:
    1942794
  • 负责人:
  • 金额:
    $ 60.95万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-07-01 至 2023-04-30
  • 项目状态:
    已结题

项目摘要

Cloud systems are the crucial infrastructure to many services existing today. Ensuring cloud software runs continuously without disruptions is both vital and challenging. Decades of research have developed mature techniques to detect and mask faults in distributed systems. But these techniques often use a simple model that assumes a system component either works or completely stops. Numerous real-world cloud incidents, however, suggest that production cloud systems frequently experience gray failures---a degraded operational mode in which a system component appears to be working but is in fact severely impaired. Gray failures cannot be effectively dealt with by current solutions. The overall objective of this proposal is to develop a holistic approach to detect, pinpoint and diagnose gray failures in production cloud systems. To realize the objective, four synergistic research activities are proposed. Specifically, the project conducts a study on real-world gray failure cases in popular distributed systems, measure and characterize the observability of existing systems. The project then designs a novel hybrid analysis that automatically inserts report-generation hooks across the whole systems stack to harness observability for detecting gray failures. To pinpoint the culprit component, this project further proposes algorithms to infer causality from the collected observations. Lastly, this project designs a runtime checking framework for increasing observability and online diagnosis of gray failures. Gray failures are a common cause of cloud service outages, resulting in significant financial loss. This project can effectively improve our understandings of gray failures and help detect and debug gray failures to reduce their impact on the ubiquitous cloud infrastructures. Software is moving to be more distributed with increasing subtle failure modes. Observability, fault detection, and localization are critical skills for this paradigm shift but are rarely covered in the existing curriculum. This project addresses this educational gap through curriculum development and student training. This project also promotes Computer Science education to underrepresented Baltimore high school students by organizing workshops in partnership with a non-profit organization, Code in the Schools, for local high school students to showcase cloud and system failure concepts.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
云系统是当今许多现有服务的关键基础设施。确保云软件持续运行而不中断既重要又具有挑战性。数十年的研究已经开发出成熟的技术来检测和屏蔽分布式系统中的故障。但这些技术通常使用一个简单的模型,假设系统组件要么工作要么完全停止。然而,大量现实世界的云事件表明,生产云系统经常遇到灰色故障——一种降级的操作模式,其中系统组件看似正常工作,但实际上受到严重损害。目前的解决方案无法有效处理灰色故障。该提案的总体目标是开发一种整体方法来检测、查明和诊断生产云系统中的灰色故障。为了实现这一目标,提出了四项协同研究活动。具体来说,该项目对流行分布式系统中的现实灰色故障案例进行研究,测量和表征现有系统的可观察性。然后,该项目设计了一种新颖的混合分析,可以自动在整个系统堆栈中插入报告生成挂钩,以利用可观察性来检测灰色故障。为了查明罪魁祸首,该项目进一步提出了从收集的观察结果中推断因果关系的算法。最后,该项目设计了一个运行时检查框架,以提高灰色故障的可观测性和在线诊断能力。灰色故障是云服务中断的常见原因,会导致重大的财务损失。该项目可以有效提高我们对灰色故障的认识,帮助检测和调试灰色故障,减少其对无处不在的云基础设施的影响。随着微妙故障模式的增加,软件正变得更加分布式。可观察性、故障检测和定位是这种范式转变的关键技能,但现有课程很少涵盖。该项目通过课程开发和学生培训来解决这一教育差距。该项目还与非营利组织 Code in the Schools 合作组织研讨会,向当地高中生展示云和系统故障概念,从而促进巴尔的摩高中学生的计算机科学教育。该奖项反映了 NSF 的法定使命和通过使用基金会的智力价值和更广泛的影响审查标准进行评估,该项目被认为值得支持。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Demystifying and Checking Silent Semantic Violations in Large Distributed Systems
揭秘并检查大型分布式系统中的无声语义违规
Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
预测性和适应性故障缓解以避免生产云虚拟机中断
  • DOI:
  • 发表时间:
    2020-12-31
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Sebastien Levy;R;olph Yao;olph;Youjiang Wu;Yingnong Dang;Peng Huang;Zheng Mu;Pu Zhao;Tarun Ramani;N. Govindaraju;Xukun Li;Qingwei Lin;Gil Lapid Shafriri;Murali Chintalapati
  • 通讯作者:
    Murali Chintalapati
Understanding and dealing with hard faults in persistent memory systems
理解和处理持久内存系统中的硬故障
Understanding, Detecting and Localizing Partial Failures in Large System Software
理解、检测和定位大型系统软件中的部分故障
RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure
RESIN:用于处理生产云基础设施中内存泄漏的整体服务
  • DOI:
    10.1016/j.jelechem.2020.114884
  • 发表时间:
    2024-09-13
  • 期刊:
  • 影响因子:
    4.5
  • 作者:
    Chang Lou;Congwei Chen;Peng Huang;Yingnong Dang;Si Qin;Xinsheng Yang;Xukun Li;Qingwei Lin;Murali Chintalapati;Microsoft Azure
  • 通讯作者:
    Microsoft Azure
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Peng Huang其他文献

Characterization and expression of HLysG2, a basic goose-type lysozyme from the human eye and testis.
HLysG2 的表征和表达,HLysG2 是一种来自人眼和睾丸的碱性鹅型溶菌酶。
  • DOI:
    10.1016/j.molimm.2010.10.008
  • 发表时间:
    2024-09-14
  • 期刊:
  • 影响因子:
    3.6
  • 作者:
    Peng Huang;Wen;Jun Xie;Xian;D. Jiang;Song;Long Yu
  • 通讯作者:
    Long Yu
Vibration Characteristics of Corn Combine Harvester with the Time-Varying Mass System under Non-Stationary Random Vibration
时变质量系统玉米联合收割机非平稳随机振动下的振动特性
  • DOI:
    10.3390/agriculture12111963
  • 发表时间:
    2022-11-21
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yanchun Yao;XiaoKe Li;Zihan Yang;Liang Li;Duanyang Geng;Peng Huang;Yongsheng Li;Zhenghe Song
  • 通讯作者:
    Zhenghe Song
NET Institute* www.NETinst.org
NET 研究所* www.NETinst.org
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Peng Huang;M. Ceccagnoli;Chris Forman;D. J. Wu
  • 通讯作者:
    D. J. Wu
Metabolic reprogramming and redox adaptation in sorafenib-resistant leukemia cells: detected by untargeted metabolomics and stable isotope tracing analysis
索拉非尼耐药白血病细胞的代谢重编程和氧化还原适应:通过非靶向代谢组学和稳定同位素示踪分析检测
  • DOI:
    10.1186/s40880-019-0362-z
  • 发表时间:
    2019-04-04
  • 期刊:
  • 影响因子:
    16.2
  • 作者:
    Xin You;Weiye Jiang;Wen;Hui Zhang;Tiantian Yu;Jingyu Tian;S. Wen;G. Garcia;Peng Huang;Yumin Hu
  • 通讯作者:
    Yumin Hu
[Antitumor activity of lycorine in renal cell carcinoma ACHN cell line and its mechanism].
石蒜碱对肾细胞癌ACHN细胞系的抗肿瘤活性及其机制

Peng Huang的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Peng Huang', 18)}}的其他基金

CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability
职业:通过利用和增强系统可观测性迈向灰色容错云
  • 批准号:
    2317751
  • 财政年份:
    2023
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
CNS Core: Small: Intelligent Fault Injection to Expose and Reproduce Production-Grade Bugs in Cloud Systems
CNS 核心:小型:智能故障注入以暴露和重现云系统中的生产级错误
  • 批准号:
    2317698
  • 财政年份:
    2023
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
FMitF: Track I: Synthesizing Semantic Checkers for Runtime Verification of Production Distributed Systems
FMITF:第一轨:综合语义检查器以进行生产分布式系统的运行时验证
  • 批准号:
    2318937
  • 财政年份:
    2023
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
CNS Core: Small: Intelligent Fault Injection to Expose and Reproduce Production-Grade Bugs in Cloud Systems
CNS 核心:小型:智能故障注入以暴露和重现云系统中的生产级错误
  • 批准号:
    2149664
  • 财政年份:
    2021
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
CRII: CSR: Toward Understanding and Automatically Detecting Specious Configuration in Large Systems
CRII:CSR:理解和自动检测大型系统中的可疑配置
  • 批准号:
    1755737
  • 财政年份:
    2018
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant

相似国自然基金

成纤维细胞分泌TGFβ1阻抑CD8+T淋巴细胞上皮向浸润在口腔白斑恶变中的作用机制及靶向干预研究
  • 批准号:
    82301095
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
流体剪切力在胸主动脉瘤向胸主动脉夹层演变中的作用及机制研究
  • 批准号:
    12372315
  • 批准年份:
    2023
  • 资助金额:
    53 万元
  • 项目类别:
    面上项目
TEA结构域转录因子2调控干细胞亚稳态向基态多能性转变的机理研究
  • 批准号:
    32300466
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
疏水FexC基催化剂上合成气向C4~C16线性α-烯烃的低碳、定向转化机制
  • 批准号:
    22302149
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
脚手架蛋白RanBP9通过调控细胞周期停滞和获得SASP介导应激性衰老促进AKI向CKD转化的作用及机制
  • 批准号:
    82300777
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability
职业:通过利用和增强系统可观测性迈向灰色容错云
  • 批准号:
    2317751
  • 财政年份:
    2023
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
Towards Generating a Multimodal and Multivariate Classification Model from Imaging and Non-Imaging Measures for Accurate Diagnosis and Monitoring of Dementia in Parkinsons disease.
从影像学和非影像学测量中生成多模式和多变量分类模型,以准确诊断和监测帕金森病痴呆。
  • 批准号:
    10241526
  • 财政年份:
    2020
  • 资助金额:
    $ 60.95万
  • 项目类别:
Comparison of tau-PET tracers: Progress towards a universal measure
tau-PET 示踪剂的比较:通用测量的进展
  • 批准号:
    10169910
  • 财政年份:
    2020
  • 资助金额:
    $ 60.95万
  • 项目类别:
Towards Generating a Multimodal and Multivariate Classification Model from Imaging and Non-Imaging Measures for Accurate Diagnosis and Monitoring of Dementia in Parkinsons disease.
从影像学和非影像学测量中生成多模式和多变量分类模型,以准确诊断和监测帕金森病痴呆。
  • 批准号:
    10028103
  • 财政年份:
    2020
  • 资助金额:
    $ 60.95万
  • 项目类别:
Visceral fat, systemic inflammation and brain-tissue health: towards early detection and prevention of Alzheimer’s disease.
内脏脂肪、全身炎症和脑组织健康:早期发现和预防阿尔茨海默病。
  • 批准号:
    10410576
  • 财政年份:
    2018
  • 资助金额:
    $ 60.95万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了