Efficient dense matrix-vector multiplication on GPU.

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE（2018）

引用 11|浏览14

暂无评分

摘要

Given that the dense matrix-vector multiplication (Ax or A(T)x) is of great importance in scientific computations, how to accelerate it is investigated on the graphics processing unit (GPU) in this paper. We present a warp-based implementation of Ax on the GPU, called GEMV-Adaptive, and a thread-based implementation of A(T)x on the GPU, called GEMV-T-Adaptive. For our proposed GEMV-Adaptive and GEMV-T-Adaptive, there are the following novelties: (1) an adaptive warp allocation strategy for GEMV-Adaptive is proposed to assign the optimal warp number for each matrix row, (2) an adaptive thread allocation strategy for GEMV-T-Adaptive is designed to assign the optimal thread number to each matrix row, and (3) several optimization schemes are formulated. Experimental results show that the proposed GEMV-Adaptive and GEMV-T-Adaptive mitigate the performance fluctuations of the implementations in the CUBLAS library, always have high performance, and outperform the most recently proposed GEMV and GEMV-T kernels by Gao et al, respectively, for all test matrices.

查看译文

关键词

CUDA,dense matrix-vector multiplication,GPU

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要