Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs

Weiling Yang,Jianbin Fang,Dezun Dong,Xing Su,Zheng Wang

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS（2024）

引用 0|浏览5

暂无评分

摘要

General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing. While the mainstream Basic Linear Algebra Subprograms (BLAS) libraries can deliver good performance on large and regular-shaped GEMMs, they are inadequate for optimizing small and irregular-shaped GEMMs, which are commonly seen in emerging HPC applications. Recent research has focused on improving GEMM performance on GPUs, but there is still significant room for improvement on emerging HPC hardware based on multi-core CPUs. We present LibShalom2, an open-source library to optimize full-spectrum GEMMs, taking small, irregular-shaped, and large-scale regular-shaped matrices. LibShalom2 explicitly targets the ARMv8 architecture, which is becoming common in HPC systems. LibShalom2 is designed to minimize the expensive memory accessing overhead for data packing and processing small matrices. It uses analytic methods to determine GEMM kernel optimization parameters, enhancing the computation and parallelization efficiency of the GEMM kernels. We evaluate LibShalom2 by applying it to three ARMv8 multi-core architectures and comparing it against five mainstream linear algebra libraries. Experimental results show that LibShalom2 consistently outperforms existing solutions across full-spectrum GEMM workloads and hardware architectures. We also show that LibShalom2 delivers an average speedup of 2.2x for real-life neural network workloads.

查看译文

关键词

Full-spectrum GEMMs,ARMv8 architectures,optimization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要