Accelerating sparse cholesky factorization on GPUs

Parallel Computing(2016)

引用 77|浏览36
暂无评分
摘要
Sparse direct factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. While the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost, many aspects of sparse factorization and GPU computing, most particularly the prevalence of small/irregular dense math and slow PCIe communication, make it challenging to fully utilize this resource. In this paper we describe a supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream branches of the elimination tree (subtrees which terminate in leaves) through the GPU and perform the factorization of each branch entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these branches, many independent, small, dense operations are batched to minimize kernel launch overhead and several of these batched kernels are executed concurrently to maximize device utilization. Supernodes towards the root of the elimination tree (where a branch involving that supernode would exceed device memory) typically involve sufficient dense math such that PCIe communication can be effectively hidden, GPU utilization is high and hybrid computing can be easily leveraged. Performance results for commonly studied matrices are presented along with suggested actions for further optimizations.
更多
查看译文
关键词
parallel programming,multicore
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要