FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX, employ significantly different architectural approaches. This paper presents novel systematic frameworks to optimize the performance of General Matrix Multiplication (GEMM), a fundamental operation in DL workloads, by exploiting the unique and distinct architectural characteristics of each FPGA. Our evaluation on GEMM workloads for int8 precision shows up to 77 and 68 TOPs (int8) throughput, with up to 0.94 and 1.35 TOPs/W energy efficiency for Versal VC1902 and Stratix 10 NX, respectively. This work provides insights and guidelines for optimizing GEMM-based applications on both platforms, while also delving into their programmability trade-offs and associated challenges.
现场可编程门阵列(FPGAs)由于其高性能、低功耗和可重构性,是加速深度学习(DL)应用的一个有前景的平台。最近,领先的FPGA供应商已经增强了其架构,以更高效地支持深度学习工作负载的计算需求。然而,两种最突出的人工智能优化的FPGA,即AMD/赛灵思Versal ACAP和英特尔Stratix 10 NX,采用了显著不同的架构方法。本文提出了新颖的系统框架,通过利用每种FPGA独特的架构特性来优化通用矩阵乘法(GEMM)——深度学习工作负载中的一种基本运算——的性能。我们对int8精度的GEMM工作负载的评估显示,Versal VC1902和Stratix 10 NX分别达到了高达77和68万亿次操作每秒(int8)的吞吐量,以及高达0.94和1.35万亿次操作每秒每瓦的能效。这项工作为在这两个平台上优化基于GEMM的应用提供了见解和指导方针,同时也深入探讨了它们在可编程性方面的权衡以及相关挑战。