喵ID:4JPqiG免责声明

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

基本信息

DOI:
10.1145/3577193.3593717
发表时间:
2023-04
期刊:
Proceedings of the 37th International Conference on Supercomputing
影响因子:
--
通讯作者:
Chengming Zhang;Shaden Smith;Baixi Sun;Jiannan Tian;Jon Soifer;Xiaodong Yu;S. Song;Yuxiong He-Yuxiong
中科院分区:
其他
文献类型:
--
作者: Chengming Zhang;Shaden Smith;Baixi Sun;Jiannan Tian;Jon Soifer;Xiaodong Yu;S. Song;Yuxiong He-Yuxiong研究方向: -- MeSH主题词: --
关键词: --
来源链接:pubmed详情页地址

文献摘要

Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.
事实证明,协作过滤(CF)是所有CF方法中最有效的技术之一,没有工作可以在多核CPU上优化单纯形,从而导致性能有限。瓶颈包括(1)不规则的内存访问,(2)不必要的内存副本和(3)冗余计算,我们提出了一个有效的CF训练系统(称为热量)现代CPU的螺纹功能。它通过并行化向量产物,而不是矩阵 - 矩阵乘法,尤其是其中的相似性计算来优化随机梯度下降(SGD),以避免矩阵数据准备的内存副本;在五个使用X86和ARM-Architecture处理器的五个广泛使用的数据集上的冗余计算中使用NVIDIA V100 GPU对现有CPU解决方案的加速和4.5倍加速度和7.9×云的成本降低。
参考文献(40)
被引文献(0)

数据更新时间:{{ references.updateTime }}

Chengming Zhang;Shaden Smith;Baixi Sun;Jiannan Tian;Jon Soifer;Xiaodong Yu;S. Song;Yuxiong He-Yuxiong
通讯地址:
--
所属机构:
--
电子邮件地址:
--
免责声明免责声明
1、猫眼课题宝专注于为科研工作者提供省时、高效的文献资源检索和预览服务;
2、网站中的文献信息均来自公开、合规、透明的互联网文献查询网站,可以通过页面中的“来源链接”跳转数据网站。
3、在猫眼课题宝点击“求助全文”按钮,发布文献应助需求时求助者需要支付50喵币作为应助成功后的答谢给应助者,发送到用助者账户中。若文献求助失败支付的50喵币将退还至求助者账户中。所支付的喵币仅作为答谢,而不是作为文献的“购买”费用,平台也不从中收取任何费用,
4、特别提醒用户通过求助获得的文献原文仅用户个人学习使用,不得用于商业用途,否则一切风险由用户本人承担;
5、本平台尊重知识产权,如果权利所有者认为平台内容侵犯了其合法权益,可以通过本平台提供的版权投诉渠道提出投诉。一经核实,我们将立即采取措施删除/下架/断链等措施。
我已知晓