cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores
arxiv(2024)
摘要
Sparse tensors are prevalent in real-world applications, often characterized
by their large-scale, high-order, and high-dimensional nature. Directly
handling raw tensors is impractical due to the significant memory and
computational overhead involved. The current mainstream approach involves
compressing or decomposing the original tensor. One popular tensor
decomposition algorithm is the Tucker decomposition. However, existing
state-of-the-art algorithms for large-scale Tucker decomposition typically
relax the original optimization problem into multiple convex optimization
problems to ensure polynomial convergence. Unfortunately, these algorithms tend
to converge slowly. In contrast, tensor decomposition exhibits a simple
optimization landscape, making local search algorithms capable of converging to
a global (approximate) optimum much faster. In this paper, we propose the
FastTuckerPlus algorithm, which decomposes the original optimization problem
into two non-convex optimization problems and solves them alternately using the
Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlus,
a fine-grained parallel algorithm designed for GPU platforms, leveraging the
performance of tensor cores. This algorithm minimizes memory access overhead
and computational costs, surpassing the state-of-the-art algorithms. Our
experimental results demonstrate that our method achieves a speedup of 3X to
5X compared to state-of-the-art algorithms.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要