High-Performance Tensor Learning Primitives Using GPU Tensor Cores

IEEE Transactions on Computers(2023)

引用 0|浏览37
暂无评分
摘要
Tensor learning is a powerful tool for big data analytics and machine learning, e.g., gene analysis and deep learning. However, tensor learning algorithms are compute-intensive since their time and space complexities grow exponentially with the order of tensors, which hinders their application. In this paper, we exploit the parallelism of tensor learning primitives using GPU tensor cores and develop high-performance tensor learning algorithms. First, we propose novel hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores. Second, for big data analytics, we employ the optimized tensor learning primitives to accelerate the CP tensor decomposition and then apply it for gene analysis. Third, we optimize the Tucker tensor decomposition and propose a novel Tucker tensor layer to compress deep neural networks. We employ natural gradients to train the neural networks, which only involve a forward pass without backpropagation and thus are suitable for GPU computations. Compared with TensorLab and TensorLy libraries on an A100 GPU, our third-order CP tensor decomposition achieves up to $16.32\times$ and $32.25\times$ speedups; and $6.09\times$ and $6.72\times$ speedups for our third-order Tucker tensor decomposition. The proposed fourth-order CP and Tucker tensor decompositions achieve up to $30.65\times$ and $5.41\times$ speedups over the TensorLab. Our CP tensor decomposition for gene analysis achieves up to $5.88\times$ speedup over TensorLy. Compared with a conventional fully connected neural network, our Tucker tensor layer neural network achieves an accuracy of $97.9\%$ , a speedup of $4.47\times$ , and a compression ratio of $2.92$ at the cost of $0.4\%$ drop in accuracy.
更多
查看译文
关键词
Tensor learning,tensor computing,GPU tensor cores,tensor layer,neural network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要