Today's large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, such as in-memory compression. To this end, in this work, we develop a fast and high- ratio error-bounded lossy compressor on GPUs for scientific data (called FZ-GPU). Specifically, we first design a new compression pipeline that consists of fully parallelized quantization, bitshuffle, and our newly designed fast encoding. Then, we propose a series of deep architectural optimizations for each kernel in the pipeline to take full advantage of CUDA architectures. We propose a warp-level optimization to avoid data conflicts for bit-wise operations in bitshuffle, maximize shared memory utilization, and eliminate unnecessary data movements by fusing different compression kernels. Finally, we evaluate FZ-GPU on two NVIDIA GPUs (i.e., A100 and RTX A4000) using six representative scientific datasets from SDRBench. Results on the A100 GPU show that FZ-GPU achieves an average speedup of 4.2× over cuSZ and an average speedup of 37.0× over a multi-threaded CPU implementation of our algorithm under the same error bound. FZ-GPU also achieves an average speedup of 2.3× and an average compression ratio improvement of 2.0× over cuZFP under the same data distortion.
当今运行在高性能计算(HPC)系统上的大规模科学应用程序会产生大量的数据,这些数据量已成为减轻存储伯恩的关键技术。在许多需要快速压缩的应用中,在许多应用程序中(例如内存压缩)达到了高压比率。工作,我们在科学数据上开发了GPU上的快速和高速公路损失压缩机(称为FZ-GPU),我们首先设计了一个新的压缩管道,该管道包括完全并行的量化,然后,我们提出了一系列对管道中每个内核的深度架构优化,以充分利用CUDA架构。 BITSHUFFLE中的位置操作的冲突,最大化共享内存利用率,并通过融合不同的压缩核来消除不必要的数据运动。 A100 GPU上的SDRBENCH表明,FZ-GPU的平均加速度为4.2倍,平均速度为37.0×在相同的误差下,我们的算法的多线程CPU实现也可以达到2.3倍的平均速度,而在相同数据失真下,平均压缩率提高了2.0×。