Efficient dense matrix-vector multiplication on GPU.

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE(2018)

引用 11|浏览14
暂无评分
摘要
Given that the dense matrix-vector multiplication (Ax or A(T)x) is of great importance in scientific computations, how to accelerate it is investigated on the graphics processing unit (GPU) in this paper. We present a warp-based implementation of Ax on the GPU, called GEMV-Adaptive, and a thread-based implementation of A(T)x on the GPU, called GEMV-T-Adaptive. For our proposed GEMV-Adaptive and GEMV-T-Adaptive, there are the following novelties: (1) an adaptive warp allocation strategy for GEMV-Adaptive is proposed to assign the optimal warp number for each matrix row, (2) an adaptive thread allocation strategy for GEMV-T-Adaptive is designed to assign the optimal thread number to each matrix row, and (3) several optimization schemes are formulated. Experimental results show that the proposed GEMV-Adaptive and GEMV-T-Adaptive mitigate the performance fluctuations of the implementations in the CUBLAS library, always have high performance, and outperform the most recently proposed GEMV and GEMV-T kernels by Gao et al, respectively, for all test matrices.
更多
查看译文
关键词
CUDA,dense matrix-vector multiplication,GPU
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要