CAREER: Avoiding Achilles' Heel in Exascale Computing with Distributed File Systems

职业：使用分布式文件系统避免百亿亿次计算中的致命弱点

基本信息

批准号：
1054974
负责人：
Ioan Raicu
金额：
$ 45万
依托单位：
Illinois Institute of Technology
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2011
资助国家：
美国
起止时间：
2011-01-01 至 2018-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1054974&HistoricalAwards=false
关键词：
CAREER Avoiding Achilles Heel Exascale

项目摘要

Exascale (i.e. 1018 operations/sec) computers will enable the unraveling of significant scientific mysteries, covering many domains (e.g. weather modeling, national security, energy, and drug discovery). Predictions are that exascales will be reached in 2019, with millions of compute-nodes and billions of threads of execution. The current state-of-the-art storage in high-end computing (HEC), in which storage is segregated from compute-nodes and connected by a network (e.g. parallel filesystems), will not scale with the expected exponential growth in concurrency. At exascales, basic functionality (e.g. booting, check-pointing, metadata/data access) at high concurrency levels will suffer poor performance, and combined with system mean-time-to-failure in hours, will lead to a performance collapse. The investigator envisions future HEC systems to be designed with non-volatile memory on every compute node, and every node to actively participate in the metadata and data management. This work aims to: 1) design, analyze, and implement a distributed data structure (D3) optimized for HEC, to be used for distributed metadata management; 2) design, analyze, and implement a distributed filesystem (FDFS) optimized for a subset of important high-performance computing (HPC) as well as many-task computing (MTC) workloads, and scalable to millions of nodes; and 3) evaluate work with real workloads, applications, and simulations up to exascales. The results of this work has the potential to make exascale computing more tractable, touching virtually all disciplines in HEC, fueling scientific discovery and economic development at the national level. The HEC knowledgebase will extend into commodity systems as the fastest machines generally become mainstream systems in five to seven years. This work can also open doors for research in radical parallel programming paradigms (e.g. MTC) that rely on scalable storage infrastructure.

Exascale（即1018操作/秒）计算机将使重大科学谜团揭开范围，涵盖许多领域（例如天气建模，国家安全，能源和药物发现）。预测将在2019年达到Exascales，其中数百万个计算节点和数十亿个执行线程。高端计算（HEC）中当前的最新存储空间，其中存储与计算节点隔离并通过网络连接（例如，并行文件系统），不会随着预期的指数增长而扩展。在Exascales，基本功能（例如启动，检查指点，元数据/数据访问）在高并发水平上的性能较差，并且与数小时内的系统平均时间到失败相结合，将导致性能崩溃。研究人员设想将未来的HEC系统在每个计算节点上都具有非挥发性内存设计，并且每个节点都可以积极参与元数据和数据管理。这项工作的目的是：1）设计，分析和实施针对HEC优化的分布式数据结构（D3），用于分布式元数据管理； 2）设计，分析和实施针对重要的高性能计算（HPC）的子集以及多任务计算（MTC）工作负载以及可扩展到数百万节点的分布式文件系统（FDF）； 3）评估由实际工作量，应用程序和模拟到Exascales的工作。这项工作的结果有可能使Exascale计算更加易于处理，几乎触及HEC中的所有学科，从而助长了国家一级的科学发现和经济发展。 HEC知识库将扩展到商品系统，因为最快的机器通常在五到七年内成为主流系统。这项工作还可以打开依靠可扩展存储基础架构的激进平行编程范例（例如MTC）的研究。