SHF: Small: Empirical Autotuning of Parallel Computation for Scalable Hybrid Systems
SHF:小型:可扩展混合系统并行计算的经验自动调整
基本信息
- 批准号:1527706
- 负责人:
- 金额:$ 45万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2015
- 资助国家:美国
- 起止时间:2015-07-15 至 2019-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Today, scientific and engineering computing is synonymous with parallel computing, and applications such as climate modeling, drug design, aircraft design, etc. utilize very large supercomputer installations, with power consumption measured in MegaWatts, and the cost of electricity measured in millions of dollars. At the same time, every parallel application requires some level of tuning to ensure that the software is mapped appropriately to the hardware. Otherwise, suboptimal performance can lead to lost cycles, kilowatt-hours, and, ultimately, dollars. Tuning the application by making repeated runs is also a wasteful option at very large scale. The DARE project addresses this problem by tuning the application through modeling and simulation of its behavior at very large scale, rather than actually running it. Therefore, resources required for tuning are marginal compared to those consumed in production runs. DARE is based on the observation that the same approach that replaces a wind tunnel with a computer simulation of the airfoil can be applied to the software itself. Two aspects of today's high-end computing landscape make the DARE work unique: 1) the prevalence of hardware accelerators, such as Graphics Processing Units and Xeon Phi co-processors, and 2) adoption of task-based, dynamic, work scheduling systems as an alternative to traditional, lock-step parallel programming models. In particular, DARE combines three components into a refinement loop: a hardware analysis component, a kernel modeling component, and a workload simulation component. The role of the hardware analysis component is to extract the basic hardware information, such as processing power and data link speed. The role of the kernel modeling component is to provide performance models of the serial kernels that constitute the building blocks of the parallel program. Finally, the role of the simulation component is to simulate large-scale parallel workloads.The hardware analysis component gathers the basic knowledge about the system, such as: the number of CPU sockets per shared memory node, the number of CPU cores in each socket, the cache hierarchy, existence of hyper-threading, number of NUMA nodes and proximity of CPUs to NUMA nodes, number of GPU accelerators or Xeon Phi co-processors and capacities of their device memories, and the topology and bandwidth of data links, both within each node (busses), and between nodes (network switches). Part of this knowledge can be gathered by using appropriate query APIs, such as hwloc, netloc, PAPI, and those provided in the CUDA SDK, OpenCL SDK, and Xeon Phi SDK. Synthetic tests can be used for parameters that cannot be established in this manner.Kernels are essentially the serial building blocks of parallel problems. Although kernels are usually characterized by serial control flow, most of the time they already rely on a high degree of data parallelism. Today's CPUs get most of their performance from SIMD parallelism, and GPUs get their performance from massive SIMT parallelism. The role of the kernel modeling component is two-fold: 1) to tune kernels for maximum performance at a given granularity, 2) to provide the kernel performance model as a function of granularity, which is changing to accommodate parallel execution.DARE turns to a stochastic time-stepping simulation in order to predict the performance of a dynamic runtime scheduler for two fundamental reasons: 1) Building good performance models on the basis of benchmarking actual parallel runs requires a significant number of runs with significant problem sizes, which is simply too time consuming. And 2), the impact of many tuning parameters is too complex to be modeled by sparsely sampling the tuning space and fitting simple curves / surfaces to the sample points. The answer to the problem is to replace the run with a time stepping simulation, where a given task-based scheduler is used for assigning tasks to cores, but instead of invoking actual kernel tasks, control is passed to a progress tracking simulation system, which relies on kernel performance models to simulate the execution of the tasks and produce a virtual trace of the simulated execution. The performance advantage is twofold: 1) Simulating a single run is much faster than actually making that run, and 2) Many simulations can be run in parallel allowing for fast sweeps through a large parameter search space.DARE replaces the standard waterfall autotuning process with a process that is incremental and iterative in nature. The power of the DARE approach lies in the mutual refinement loop, where each of the three phases is capable of massively pruning the search space for the other two. As a result, very high quality models can be built for a particular workload, since time is being spent refining the model for the conditions that actually apply, rather than sampling the search space in areas never touched at runtime.
如今,科学和工程计算已成为并行计算的代名词,气候建模、药物设计、飞机设计等应用都使用非常大型的超级计算机装置,其功耗以兆瓦为单位,电费以数百万美元为单位。同时,每个并行应用程序都需要一定程度的调整,以确保软件正确映射到硬件。否则,性能不佳可能会导致周期、千瓦时的损失,并最终导致金钱的损失。在大规模情况下,通过重复运行来调整应用程序也是一种浪费的选择。 DARE 项目通过对应用程序的行为进行大规模建模和模拟来调整应用程序,而不是实际运行它,从而解决了这个问题。因此,与生产运行中消耗的资源相比,调整所需的资源是微不足道的。 DARE 是基于这样的观察:用计算机模拟机翼代替风洞的相同方法可以应用于软件本身。当今高端计算领域的两个方面使 DARE 工作独一无二:1) 硬件加速器的普及,例如图形处理单元和 Xeon Phi 协处理器,以及 2) 采用基于任务的动态工作调度系统传统锁步并行编程模型的替代方案。特别是,DARE 将三个组件组合成一个细化循环:硬件分析组件、内核建模组件和工作负载模拟组件。硬件分析组件的作用是提取基本的硬件信息,例如处理能力和数据链路速度。内核建模组件的作用是提供构成并行程序构建块的串行内核的性能模型。最后,仿真组件的作用是模拟大规模并行工作负载。硬件分析组件收集有关系统的基本知识,例如:每个共享内存节点的CPU插槽数、每个插槽中的CPU核心数、缓存层次结构、超线程的存在、NUMA 节点的数量以及 CPU 与 NUMA 节点的接近度、GPU 加速器或 Xeon Phi 协处理器的数量及其设备内存的容量,以及数据链路的拓扑和带宽,两者每个节点内(总线)以及节点之间(网络交换机)。部分知识可以通过使用适当的查询 API 来收集,例如 hwloc、netloc、PAPI 以及 CUDA SDK、OpenCL SDK 和 Xeon Phi SDK 中提供的 API。综合测试可用于无法以这种方式建立的参数。内核本质上是并行问题的串行构建块。尽管内核通常以串行控制流为特征,但大多数时候它们已经依赖于高度的数据并行性。当今的 CPU 的大部分性能来自 SIMD 并行性,而 GPU 的性能则来自大规模 SIMT 并行性。内核建模组件的作用有两个:1) 调整内核以在给定粒度下实现最大性能,2) 提供作为粒度函数的内核性能模型,粒度不断变化以适应并行执行。DARE 转向随机时间步进模拟,以预测动态运行时调度程序的性能,有两个根本原因:1)在对实际并行运行进行基准测试的基础上构建良好的性能模型需要大量具有重大问题规模的运行,这很简单太费时间了。 2)许多调整参数的影响过于复杂,无法通过对调整空间进行稀疏采样并将简单曲线/曲面拟合到样本点来进行建模。问题的答案是用时间步进模拟代替运行,其中给定的基于任务的调度程序用于将任务分配给内核,但不是调用实际的内核任务,而是将控制传递给进度跟踪模拟系统,该系统依赖内核性能模型来模拟任务的执行并生成模拟执行的虚拟跟踪。性能优势有两个:1) 模拟单次运行比实际运行要快得多,2) 许多模拟可以并行运行,从而可以快速扫描大参数搜索空间。DARE 用以下方式取代了标准瀑布自动调整过程本质上是增量和迭代的过程。 DARE 方法的强大之处在于相互细化循环,其中三个阶段中的每个阶段都能够大规模修剪其他两个阶段的搜索空间。因此,可以为特定工作负载构建非常高质量的模型,因为时间花在针对实际适用的条件完善模型上,而不是在运行时从未触及的区域中对搜索空间进行采样。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Jack Dongarra其他文献
hipMAGMA v1.0
hipMAGMA v1.0
- DOI:
- 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Cade Brown;Ahmad Abdelfattah;Stanimire Tomov;Jack Dongarra - 通讯作者:
Jack Dongarra
The eigenvalue problem for Hermitian matrices with time reversal symmetry
具有时间反演对称性的 Hermitian 矩阵的特征值问题
- DOI:
10.1016/0024-3795(84)90068-5 - 发表时间:
1984 - 期刊:
- 影响因子:1.1
- 作者:
Jack Dongarra;J. R. Gabriel;D. D. Koelling;James Hardy Wilkinson - 通讯作者:
James Hardy Wilkinson
Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU clusters
使用分层矩阵分析 BiCGStab 在 GPU 集群上的性能
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Ichitaro Yamazaki;Ahmad Abdelfattah;Akihiro Ida;Satoshi Ohshima;Stanimire Tomov;Rio Yokota;Jack Dongarra - 通讯作者:
Jack Dongarra
Jack Dongarra的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Jack Dongarra', 18)}}的其他基金
Travel: Workshop on Clusters, Clouds, and Data Analytics for Scientific Computing 2024
旅行:2024 年科学计算集群、云和数据分析研讨会
- 批准号:
2336813 - 财政年份:2023
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
Workshop on Clusters, Clouds, and Data Analytics for Scientific Computing
科学计算集群、云和数据分析研讨会
- 批准号:
2001329 - 财政年份:2020
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
Workshop on Clusters, Clouds, and Data Analytics in Scientific Computing
科学计算中的集群、云和数据分析研讨会
- 批准号:
1800946 - 财政年份:2018
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
Toward a common digital continuum platform for big data and extreme-scale computing (BDEC2)
迈向大数据和超大规模计算的通用数字连续平台 (BDEC2)
- 批准号:
1849625 - 财政年份:2018
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
Collaborative Research: ACI-CDS&E: Highly Parallel Algorithms and Architectures for Convex Optimization for Realtime Embedded Systems (CORES)
合作研究:ACI-CDS
- 批准号:
1709069 - 财政年份:2017
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
Workshop on Clusters, Clouds and Data Analytics in Scientific Computing
科学计算中的集群、云和数据分析研讨会
- 批准号:
1606551 - 财政年份:2016
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
Collaborative Research: EMBRACE: Evolvable Methods for Benchmarking Realism through Application and Community Engagement
合作研究:拥抱:通过应用和社区参与对现实主义进行基准测试的演化方法
- 批准号:
1535025 - 财政年份:2015
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
SI2-SSI: Collaborative Proposal: Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX)
SI2-SSI:协作提案:极端规模环境的性能应用程序编程接口 (PAPI-EX)
- 批准号:
1450429 - 财政年份:2015
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
CSR:Medium:Collaborative Research: SparseKaffe: high-performance, auto-tuned, energy-aware algorithms for sparse direct methods on modern heterogeneous architectures
CSR:Medium:协作研究:SparseKaffe:现代异构架构上稀疏直接方法的高性能、自动调整、能量感知算法
- 批准号:
1514286 - 财政年份:2015
- 资助金额:
$ 45万 - 项目类别:
Continuing Grant
EAGER: Collaborative Research: Memristive Accelerator for Extreme Scale Linear Solvers
EAGER:协作研究:用于超大规模线性求解器的忆阻加速器
- 批准号:
1548093 - 财政年份:2015
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
相似国自然基金
新冠疫情下小微企业的经营风险与公共政策效果评估:来自餐饮企业的经验证据
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于经验小波变换的流体管网泄漏多方向多模态声发射时频定位方法研究
- 批准号:61703066
- 批准年份:2017
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
基于声发射信号改进经验小波分析的钢桥面板疲劳裂纹定量监测方法研究
- 批准号:51708164
- 批准年份:2017
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
经验小波变换理论及其在机械故障诊断中的应用研究
- 批准号:51505002
- 批准年份:2015
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
抽样调查中的小域估计方法研究
- 批准号:11301514
- 批准年份:2013
- 资助金额:22.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Empirical Research on Formation of new HR-Practices in German Firms
德国企业新人力资源实践形成的实证研究
- 批准号:
22K01719 - 财政年份:2022
- 资助金额:
$ 45万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
The Empirical Study of Gender (EGEN) Research Network: Small Research Prizes to Graduate Students and Early Career Faculty
性别实证研究 (EGEN) 研究网络:为研究生和早期职业教师提供小型研究奖
- 批准号:
2215500 - 财政年份:2022
- 资助金额:
$ 45万 - 项目类别:
Standard Grant
An Attempt to Improve Empirical Research in Economics Focusing on Statistical Hypothesis Testing
以统计假设检验为重点改进经济学实证研究的尝试
- 批准号:
22K18530 - 财政年份:2022
- 资助金额:
$ 45万 - 项目类别:
Grant-in-Aid for Challenging Research (Exploratory)
Empirical Studies on Inclusiveness and Exclusiveness of Sharing of Technologies in East African Small and Medium-sized Manufacturers
东非中小型制造商技术共享包容性与排他性实证研究
- 批准号:
21H03706 - 财政年份:2021
- 资助金额:
$ 45万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Comparative Empirical Research on the Economic Effects of the Lehman Brothers Collapse and COVID-19 Pandemic
雷曼兄弟倒闭和 COVID-19 大流行的经济影响的比较实证研究
- 批准号:
21K01590 - 财政年份:2021
- 资助金额:
$ 45万 - 项目类别:
Grant-in-Aid for Scientific Research (C)