Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.
事实证明,协作过滤(CF)是所有CF方法中最有效的技术之一,没有工作可以在多核CPU上优化单纯形,从而导致性能有限。瓶颈包括(1)不规则的内存访问,(2)不必要的内存副本和(3)冗余计算,我们提出了一个有效的CF训练系统(称为热量)现代CPU的螺纹功能。它通过并行化向量产物,而不是矩阵 - 矩阵乘法,尤其是其中的相似性计算来优化随机梯度下降(SGD),以避免矩阵数据准备的内存副本;在五个使用X86和ARM-Architecture处理器的五个广泛使用的数据集上的冗余计算中使用NVIDIA V100 GPU对现有CPU解决方案的加速和4.5倍加速度和7.9×云的成本降低。