Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations.
在GPU计算中,管理CPU和GPU之间的内存是一项重大挑战。英伟达最近推出了一种编程模型——统一内存访问(UMA),旨在简化内存管理的复杂性,同时声称具有良好的整体性能。在本文中,我们对该编程模型进行了研究,并根据实验结果评估了其性能以及编程模型的简化情况。我们发现,除了按需向CPU传输数据外,GPU还能够按需请求所需数据的子集。这一特性使得UMA在某些并行应用和小数据量的情况下优于完整的数据传输方法。然而,我们还发现,对于大多数应用和内存访问模式,与UMA相关的性能开销是显著的,而且编程模型的简化限制了未来添加优化的灵活性。