CCRI: ENS: Collaborative Research: Open Computer System Usage Repository and Analytics Engine

CCRI:ENS:协作研究:开放计算机系统使用存储库和分析引擎

基本信息

  • 批准号:
    2016704
  • 负责人:
  • 金额:
    $ 118.39万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-10-01 至 2023-09-30
  • 项目状态:
    已结题

项目摘要

In science and engineering research, large-scale, centrally managed computing clusters or “supercomputers” have been instrumental in enabling the kinds of resource-intensive simulations, analyses, and visualizations that have been used in computer-aided drug discovery, high strength materials design for cars and jet engines, and disease vector analysis to name a few. Such clusters are complex systems comprised of several hundred to thousand computer servers with fast network connections between them, various data storage resources, and highly optimized scientific software being shared with several hundred other researchers from diverse domains. Consequently, the overall dependability of such systems relies on the dependability of these individual highly interconnected elements as well as the characteristics of cascading failures. While computer systems researchers and practitioners have been at the forefront of designing and deploying dependable computing cluster systems, this task has been hampered by the lack of publicly available, real-world failure data from supercomputers currently in operation. Prior practice has largely involved tedious, manual collection and curation of small sets of data for use in specific analyses. This project will establish seamless, automated pipelines for acquiring, processing, and curating continuous, detailed system usage, monitoring, and failure data from large computing clusters at two organizations, Purdue University and the University of Texas at Austin. This data will be disseminated through a publicly accessible portal and complemented by a suite of in-situ analytics capabilities that will support and spur research in dependable computing systems. The data acquisition pipeline and analytics software will be made open-source and designed for ease of federation, extension, and adoption to cluster systems operated by other organizations.Cluster computing systems are a key resource in time-sensitive, computationally intensive research such as virus structure modeling and drug discovery and have been at the forefront of efforts to tackle global pandemics. Both unanticipated system down-times and lack of actionable feedback to researchers on computational failures can have adverse effects on research timeliness and efficiency. This project will allow the practitioners and administrators of these systems to develop data-backed best practices for ensuring high availability and utilization for their clusters. The resulting large, public data repository consisting of data from clusters with diverse workloads spanning traditional high-performance computing, modern accelerator-based computing (for example on graphics processing units (GPUs)), and cloud-style applications will allow the systems research community to consider forward-looking research questions based on real system data. The project will train a cadre of students in data analysis on live production systems and this will provide them with a unique learning experience, interfacing with a variety of stakeholders.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在科学和工程研究中,大型,中央管理的计算簇或“超级计算机”有助于实现在计算机辅助药物发现中使用的各种资源密集型模拟,分析和可视化,用于汽车和喷气发动机的高强度材料设计以及疾病媒介分析的高强度材料设计,以命名一些。这样的群集是复杂的系统完成了数百万千名计算机服务器,它们之间的网络连接,各种数据存储资源以及高度优化的科学软件与来自不同领域的其他数百位研究人员共享。因此,此类系统的总体可靠性取决于这些个人高度相互联系的要素的可靠性以及级联故障的特征。尽管计算机系统的研究人员和从业人员一直处于设计和部署可靠的计算集群系统的最前沿,但由于缺乏目前正在运行的超级计算机的公开可用的现实世界故障数据,该任务受到了阻碍。先前的实践在很大程度上涉及,手动收集和策划少量数据用于特定分析。该项目将建立无缝的,自动化的管道,用于获取,处理和策划连续的,详细的系统使用,监视和失败数据,来自两个组织,普渡大学和德克萨斯大学奥斯汀分校的大型计算集群。这些数据将通过公共访问的门户传播,并由一套原位分析功能完成,这些功能将支持和刺激可靠的计算系统中的研究。数据采集​​管道和分析软件将被制造开源,并设计用于易于联邦,扩展和对其他组织运营的集群系统的采用。集群计算系统是时间敏感的,计算强度的研究中的关键资源,例如病毒结构建模和药物发现,并且一直在努力攻击全球Pandem Pandemics。意想不到的系统下降和缺乏对研究人员对计算故障的可行反馈都可能对研究及时性和效率产生不利影响。该项目将使这些系统的实践者和管理人员能够开发出数据支持的最佳实践,以确保其集群的高可用性和利用率。由此产生的大型公共数据存储库由来自潜水员工作负载的群集的数据组成,这些数据涵盖了传统的高性能计算,基于现代加速器的计算(例如,图形处理单元(GPU)(GPU)和云式应用程序以及系统研究社区将允许基于实际系统数据的前瞻性研究问题。该项目将在实时生产系统的数据分析中培训一群学生,这将为他们提供独特的学习经验,与各种利益相关者接触。该奖项反映了NSF的法定任务,并通过使用基金会的知识分子优点和更广泛的影响标准通过评估来诚实地支持支持。

项目成果

期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
ORION and the Three Rights: Sizing, Bundling, and Prewarming for Serverless DAGs
ORION 和三项权利:无服务器 DAG 的规模调整、捆绑和预热
Root Cause Analysis of Failures in Microservices through Causal Discovery
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Azam Ikram;Sarthak Chakraborty;Subrata Mitra;S. Saini;S. Bagchi;Murat Kocaoglu
  • 通讯作者:
    Azam Ikram;Sarthak Chakraborty;Subrata Mitra;S. Saini;S. Bagchi;Murat Kocaoglu
AutoForecast: Automatic Time-Series Forecasting Model Selection
Closing-the-Loop: A Data-Driven Framework for Effective Video Summarization
  • DOI:
    10.1109/ism.2020.00042
  • 发表时间:
    2020-12
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ran Xu;Haoliang Wang;Stefano Petrangeli;Viswanathan Swaminathan;S. Bagchi
  • 通讯作者:
    Ran Xu;Haoliang Wang;Stefano Petrangeli;Viswanathan Swaminathan;S. Bagchi
SONIC: Application-aware Data Passing for Chained Serverless Applications
  • DOI:
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ashraf Y. Mahgoub;K. Shankar;S. Mitra;Ana Klimovic;S. Chaterji;S. Bagchi
  • 通讯作者:
    Ashraf Y. Mahgoub;K. Shankar;S. Mitra;Ana Klimovic;S. Chaterji;S. Bagchi
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Saurabh Bagchi其他文献

A Survey Article on Wormhole Attack Detection and Security in Wireless Sensor Networks
关于无线传感器网络中虫洞攻击检测和安全的调查文章
  • DOI:
    10.5120/ijca2017915666
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Gaurav Tejpal;Sonal Sharma;Khalil;Issa;Saurabh Bagchi;N. Shroff;S. Krishnamurthy
  • 通讯作者:
    S. Krishnamurthy

Saurabh Bagchi的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Saurabh Bagchi', 18)}}的其他基金

NSF Workshop on State-of-the-Art and Challenges in Resilience
美国国家科学基金会关于复原力的最新技术和挑战研讨会
  • 批准号:
    2140139
  • 财政年份:
    2021
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
NSF Workshop on State-of-the-Art and Challenges in Resilience
美国国家科学基金会关于复原力的最新技术和挑战研讨会
  • 批准号:
    1845192
  • 财政年份:
    2018
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
CI-NEW: Collaborative Research: Computer System Failure Data Repository to Enable Data-Driven Dependability
CI-NEW:协作研究:计算机系统故障数据存储库以实现数据驱动的可靠性
  • 批准号:
    1513197
  • 财政年份:
    2015
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
CSR: Small: Diagnosing Performance and Correctness Errors in Parallel Applications at Large Scales
CSR:小:诊断大规模并行应用程序中的性能和正确性错误
  • 批准号:
    1527262
  • 财政年份:
    2015
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
CI-P: Computer System Failure Data Repository to Enable Data-Driven Dependability Research
CI-P:计算机系统故障数据存储库,支持数据驱动的可靠性研究
  • 批准号:
    1405906
  • 财政年份:
    2014
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
NeTS: Medium: Collaborative Research: Tango: Performance and Fault Management in Cellular Networks through Device-Network Cooperation
NeTS:媒介:协作研究:Tango:通过设备网络协作进行蜂窝网络的性能和故障管理
  • 批准号:
    1409506
  • 财政年份:
    2014
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Continuing Grant
Travel Grants for Attending the 29th IEEE Symposium on Reliable Distributed Systems (SRDS)
参加第 29 届 IEEE 可靠分布式系统 (SRDS) 研讨会的旅费补助
  • 批准号:
    1047647
  • 财政年份:
    2010
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
CSR: Small: Monitoring for Error Detection in Today's High Throughput Applications
CSR:小:监控当今高吞吐量应用程序中的错误检测
  • 批准号:
    0916337
  • 财政年份:
    2009
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
NeTS-NOSS: Robust Sensor Network Architecture through Neighborhood Monitoring and Isolation
NeTS-NOSS:通过邻域监控和隔离实现稳健的传感器网络架构
  • 批准号:
    0626830
  • 财政年份:
    2006
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
Sensors: Smart RF Antennas for Reliable and Real-Time Sensor Networks
传感器:用于可靠、实时传感器网络的智能射频天线
  • 批准号:
    0330016
  • 财政年份:
    2003
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant

相似国自然基金

水稻EnS150基因调控种子休眠和萌发的分子机制研究
  • 批准号:
    32301853
  • 批准年份:
    2023
  • 资助金额:
    30.00 万元
  • 项目类别:
    青年科学基金项目
生孢梭菌通过“IPA-AHR-mTOR”轴调控ENPC自噬参与糖尿病ENS重建的机制研究
  • 批准号:
    82300616
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于肠道菌群/5-HT/ENS调控的番茄红素改善肠动力作用机制研究
  • 批准号:
    32101968
  • 批准年份:
    2021
  • 资助金额:
    24.00 万元
  • 项目类别:
    青年科学基金项目
MSCs胞外囊泡调控ENPC的SETD2/H3K36轴在糖尿病ENS重建中的作用及机制研究
  • 批准号:
    82100569
  • 批准年份:
    2021
  • 资助金额:
    24.00 万元
  • 项目类别:
    青年科学基金项目
基于肠道菌群/5-HT/ENS调控的番茄红素改善肠动力作用机制研究
  • 批准号:
  • 批准年份:
    2021
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
  • 批准号:
    2235160
  • 财政年份:
    2023
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
  • 批准号:
    2235157
  • 财政年份:
    2023
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
  • 批准号:
    2235158
  • 财政年份:
    2023
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
  • 批准号:
    2235159
  • 财政年份:
    2023
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
Collaborative Research: CCRI: ENS: Boa 2.0: Enhancing Infrastructure for Studying Software and its Evolution at a Large Scale
合作研究:CCRI:ENS:Boa 2.0:增强大规模研究软件及其演化的基础设施
  • 批准号:
    2120448
  • 财政年份:
    2021
  • 资助金额:
    $ 118.39万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了