SHF:Small: Solving the Problem of Scalable Multi-Precision Matrix Arithmetic on GPUs

SHF:Small：解决 GPU 上可扩展多精度矩阵算术问题

基本信息

批准号：
1217590
负责人：
Charles Weems
金额：
$ 45万
依托单位：
University of Massachusetts Amherst
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2012
资助国家：
美国
起止时间：
2012-06-01 至 2016-05-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1217590&HistoricalAwards=false
关键词：
SHF Small Solving Problem Scalable

项目摘要

Computers directly support arithmetic that is typically limited to 64 bits (about 19 decimal digits) of precision. Applications that need more precision must implement arithmetic through computationally expensive software. Beyond about 256 bits of precision, such calculations become quite costly. The RSA encryption algorithm, for example, can require arithmetic with up to 4096 bits of precision. Applications in areas such as experimental mathematics and number theory can require millions of bits of precision. One multiplication with 10 million bits of precision can take a tenth of a second to compute on a modern processor, which means that matrix arithmetic using such large values can take days to weeks to execute. In previous work the investigators have shown that it is possible to obtain a factor of 20 improvement in performance by utilizing the parallel processing capabilities of a commodity graphics processing unit (GPU) in place of the traditional CPU. However, programming a GPU to achieve this level of performance is quite difficult, and the resulting code requires considerable hand-tuning to move it to new generations of GPU and gain the advantage of their performance, which is scaling up at a rate that exceeds CPU performance scaling.This project is working to develop a framework that automatically generates and tunes multi-precision arithmetic libraries to execute on successive generations of GPUs. The libraries include both scalar and basic matrix arithmetic routines. They support scaling in precision as well as matrix size. The problem is challenging because different parallel algorithms must be automatically selected for different levels of precision, which must be balanced with the exploitation of the alternate dimension of parallelism inherent in matrix arithmetic. In addition, the work seeks to employ distributed parallelism across a cluster of computers enhanced with GPUs, so that the libraries can be used on a new generation of GPU-based supercomputers that is beginning to be deployed at national laboratories. The work is significant because it enables easier exploitation of low-cost commodity graphics processors to achieve more than an order of magnitude increase in performance for multi-precision scalar and matrix arithmetic. One important application is enhancing performance of RSA encryption to support longer, more secure keys, at greater data rates, so that it becomes feasible to encrypt greater volumes of internet traffic. Another important use is experimental mathematics, where computationally expensive functions (e.g., integrals, infinite series) are computed at high precision and compared to other functions and high precision constants to help identify more efficient closed-form solutions. Results from experimental mathematics have found applications in particle physics, chaos theory, and calculation of fundamental constants. The resulting software framework offers a significant performance enhancement for multi-precision arithmetic to systems that range from individual researcher workstations to large supercomputers.

计算机直接支持算术，该算术通常仅限于64位（约19位小数位）精度。需要更精确的应用程序必须通过计算昂贵的软件来实现算术。超过大约256位的精度，此类计算变得非常昂贵。例如，RSA加密算法可能需要算术，最多需要4096位精确度。在实验数学和数字理论等领域的应用可能需要数百万的精确度。一个具有1000万位精确度的乘法可能需要十分之一的时间来计算现代处理器，这意味着使用如此庞大的值矩阵算术可能需要几天到几周才能执行。在先前的工作中，调查人员表明，通过利用商品图形处理单元（GPU）代替传统CPU，可以通过利用商品图形处理单元（GPU）的并行处理能力来提高性能。 However, programming a GPU to achieve this level of performance is quite difficult, and the resulting code requires considerable hand-tuning to move it to new generations of GPU and gain the advantage of their performance, which is scaling up at a rate that exceeds CPU performance scaling.This project is working to develop a framework that automatically generates and tunes multi-precision arithmetic libraries to execute on successive generations of GPUs.库包括标量和基本矩阵算术例程。它们支持精确度和矩阵尺寸的缩放。该问题是具有挑战性的，因为必须自动选择不同的平行算法以不同的精度级别，这必须与矩阵算术中固有的并行性的替代维度的利用平衡。此外，该作品旨在利用GPU增强的一组计算机群中采用分布式并行性，以便可以将图书馆用于新一代的基于GPU的超级计算机，这些超级计算机已开始在国家实验室部署。这项工作很重要，因为它可以更轻松地利用低成本商品图形处理器，以实现多个精确标量和矩阵算术的性能的数量级增加。一个重要的应用程序是增强RSA加密的性能，以更长的数据速率支持更长，更安全的密钥，以便加密更多互联网流量的量变得可行。另一个重要用途是实验数学，其中计算昂贵的函数（例如积分，无限序列）以高精度计算，并与其他功能和高精度常数进行比较，以帮助识别更有效的闭合式溶液。实验数学的结果发现了在粒子物理学，混乱理论和基本常数计算中的应用。最终的软件框架为多精确算术到从单个研究人员工作站到大型超级计算机的系统提供了显着的性能提高。