Collaborative Research: OAC Core: Enabling Extremely Fine-grained Parallelism on Modern Many-core Architectures

合作研究：OAC Core：在现代多核架构上实现极其细粒度的并行性

基本信息

批准号：
2107283
负责人：
Kyle Chard
金额：
$ 16.63万
依托单位：
University of Chicago
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-07-01 至 2024-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2107283&HistoricalAwards=false
关键词：
Collaborative Research OAC Core Enabling

项目摘要

Computer systems are becoming increasingly complex: multisocket systems with many-core processors and general graphic processors have the potential to address the needs of demanding applications at the node level. Programmability and efficiency are often not easy to find together due to the hardware growing several orders of magnitude in degree of parallelism to thousands of computing units on a chip. Task parallelism is an important type of parallelism in which computation is broken down into a set of inter-dependent tasks which can be executed concurrently on various computing units. To achieve strong scaling and high levels of effective parallelism, there is a growing need in today's parallel languages with supporting over-decomposition (many more tasks than cores) in order to improve performance, hide latency caused by blocking operations, and otherwise achieve maximum speedup. By enabling the efficient support of fine-grained parallelism across the growing range of scales seen in modern and future hardware, it is expected that the productivity of parallel programmers will be enhanced. Trends show evidence that most of the Top500 high-performance computing systems will likely employ hardware that this work directly targets. The project aims to conduct a high-impact education program in distributed parallel programming with broad reach, encouraging student internships grounded in real-world challenges, and paving the way for technology transfer from research to open-source projects. Special emphasis is placed on engaging women and underrepresented minorities. This education facet will create a new and more accessible foundation for fluency in parallel computing for scientists and engineers.This work explores novel data-structures and algorithms that allow for scalable runtime and execution models for fine-grained parallelism at sub-microsecond timescales. Preliminary work by the PIs at the language and runtime levels suggests a path to achieving this. The project objectives are: 1) unifying runtime enabling task granularities measured in cycles: design, analysis, and implementation of building blocks for efficient fine-grained computing on diverse node hardware; 2) evaluating performance of these building blocks in the context of real parallel systems and application kernels on a range of computer architectures; 3) measuring performance and scalability impact of runtime on benchmark kernels and real applications; and 4) integrating this research with education programs from undergraduate to graduate levels through new course material on parallel computing. This high-risk/high-reward research is geared towards yielding transformative improvements in the ease and efficiency of programming parallel machines at every scale. The contributions lie in the realization of productive, implicitly parallel high-level languages optimized for single node deployments with many-core architectures to support fine-grained parallelism measured in cycles, enabling an entirely new class of many-task computing applications. The dataflow architecture makes implicit parallelism tractable with a programming model whose impact could rival that of MATLAB, R, and Python, with the added benefit that the same code could also run in a distributed system or large-scale HPC systems. Thus, the scientist would be able to write a program once, run it at any suitable scale, and have it seamlessly use the most appropriate granularity for each component of the hardware. This work’s innovations in dataflow architecture will be broadly applicable to a number of existing parallel programming systems such as OpenMP, Swift/Parsl, and CUDA/OpenCL, in terms of both efficiency in executing fine grained parallelism and adding support for implicit parallelism where possible. Target hardware includes Intel/AMD x86, ThunderX/2 ARM, IBM Power9, and NVIDIA/AMD GPUs.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

计算机系统变得越来越复杂：具有多核处理器和通用图形处理器的多插槽系统有可能满足节点级苛刻应用的需求，由于硬件数量不断增长，可编程性和效率通常不容易同时满足。任务并行性是一种重要的并行性类型，其中计算被分解为一组相互依赖的任务，这些任务可以在不同的计算单元上同时执行。规模化和高水平有效的并行性，当今的并行语言越来越需要支持过度分解（比核心更多的任务），以提高性能，隐藏阻塞操作引起的延迟，并通过启用高效支持来实现最大加速。随着现代和未来硬件中越来越多的规模的细粒度并行性的发展，预计并行程序员的生产力将会提高。趋势表明，大多数 Top500 高性能计算系统可能会采用这样的硬件。直接工作该项目旨在开展具有广泛影响力的分布式并行编程教育计划，鼓励学生基于现实世界的挑战进行实习，并为从研究到开源项目的技术转移铺平道路。这一教育方面将为科学家和工程师流畅地进行并行计算奠定新的、更容易获得的基础。这项工作探索了新颖的数据结构和算法，这些数据结构和算法允许可扩展的运行时和执行模型，以实现细粒度的并行性。在PI 在语言和运行时级别上的初步工作提出了实现这一目标的途径：1) 统一运行时，实现以周期为单位测量的任务粒度：设计、分析和实现构建块以实现高效。不同节点硬件上的细粒度计算；2) 在一系列计算机架构上的真实并行系统和应用程序内核的背景下评估这些构建块的性能；3) 测量运行时的性能和可扩展性影响；基准内核和实际应用；4）通过关于并行计算的新课程材料将这项研究与从本科到研究生的教育计划相结合。这项高风险/高回报的研究旨在实现编程简便性和效率的变革性改进。其贡献在于实现了生产、隐式并行高级语言，针对具有多核架构的单节点部署进行了优化，以支持按周期测量的细粒度并行性，从而实现了全新的多核并行化。 -任务计算应用程序。数据流架构使隐式并行性可以通过编程模型来处理，其影响可以与 MATLAB、R 和 Python 相媲美，并且具有相同的代码也可以在分布式系统或大型 HPC 系统中运行的额外好处。能够编写一次程序，以任何合适的规模运行它，并让它无缝地使用最适合硬件每个组件的粒度。这项工作在数据流架构方面的创新将广泛适用于许多现有的并行编程系统。作为OpenMP、Swift/Parsl 和 CUDA/OpenCL，在执行细粒度并行性方面的效率以及在可能的情况下添加对隐式并行性的支持目标硬件包括 Intel/AMD x86、ThunderX/2 ARM、IBM Power9 和 NVIDIA/AMD。 GPU。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（1）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures

通过现代多核架构上的可扩展并发队列实现极其细粒度的并行性

DOI：
10.1109/mascots53633.2021.9614292
发表时间：
2021-11
期刊：
and Simulation of Computer and Telecommunication Systems (MASCOTS '21
影响因子：
0
作者：
Nookala, Poornima;Dinda, Peter;Hale, Kyle C.;Chard, Kyle;Raicu, Ioan
通讯作者：
Raicu, Ioan

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Kyle Chard其他文献

A Distributed Economic Meta-scheduler for the Grid

网格的分布式经济元调度器

DOI：
10.1109/ccgrid.2008.48
发表时间：
2008-05-19
期刊：
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
影响因子：
0
作者：
Kyle Chard;K. Bubendorfer
通讯作者：
K. Bubendorfer

QoS-aware edge AI placement and scheduling with multiple implementations in FaaS-based edge computing

基于 FaaS 的边缘计算中具有多种实现的 QoS 感知边缘 AI 布局和调度

DOI：
10.1016/j.future.2024.03.035
发表时间：
2024-03-01
期刊：
Future Gener. Comput. Syst.
影响因子：
0
作者：
Nathaniel Hudson;Hana Khamfroush;Matt Baughman;D. Lucani;Kyle Chard;Ian T. Foster
通讯作者：
Ian T. Foster

SECRE: Surrogate-Based Error-Controlled Lossy Compression Ratio Estimation Framework

SECRE：基于代理的误差控制有损压缩比估计框架

DOI：
发表时间：
2023
期刊：
International Conference on High Performance Computing
影响因子：
0
作者：
Arham Khan;S. Di;Kai Zhao;Jinyang Liu;Kyle Chard;Ian T. Foster;Franck Cappello
通讯作者：
Franck Cappello

Regulating Trafﬁc in a Crowded Cache: Overcoming the Container Explosion Problem

调节拥挤缓存中的流量：克服容器爆炸问题

DOI：
发表时间：
2021
期刊：
影响因子：
0
作者：
Kevin Gao;Tim Shaffer;Kyle Chard
通讯作者：
Kyle Chard

Using Facebook as a Cloud Platform for Solving Numerical Optimization Problem

使用 Facebook 作为解决数值优化问题的云平台

DOI：
10.5120/8739-3197
发表时间：
2012-10-20
期刊：
International Journal of Computer Applications
影响因子：
0
作者：
M. R. Islam;S. Mahi;Abu Sina;Mohammad Raju Chowdhury;Kyle Chard;Simon Caton;Omer Rana;K. Bubendorfer;O. Mengshoel;David E. Goldberg
通讯作者：
David E. Goldberg