Characterize and Optimize Dense Linear Solver on Multi-core CPUs.

International Conference on Parallel and Distributed Systems(2023)

引用 0|浏览0
暂无评分
摘要
The dense linear solver is an essential subroutine in high-performance computing. Typical parallel implementations either adopt the fork-join or task parallel programming models. Blocked algorithms built upon the fork-join paradigm focus on optimizing cache locality, leaving significant synchronization overhead. Following the data-driven execution model, tile-based algorithms formed on the task parallel paradigm effectively relieve the pain and exhibit superior load balancing. Nevertheless, they introduce redundant memory access expenses, plaguing the CPU execution. In this paper, we first characterize and quantify the impact of the performance bottlenecks in-depth and then propose a series of optimizations. Specifically, we reduce the idle time of threads by merging LU factorization with the subsequent lower triangular solver to improve parallelism. Moreover, we eliminate tile-based matrix format transformation and diminish duplicated data packing operations to lower memory access overhead. Performance evaluation is conducted on two modern multi-core systems, Intel Xeon Gold(R) 6252N and HiSilicon Kunpeng 920. The evaluation results demonstrate the superiority of our proposed solver over state-of-the-art open-source implementations, achieving performance gains of up to 11.5% and 12.2% on the respective platforms.
更多
查看译文
关键词
Parallelization,Intel Xeon,Matrix Formation,Lower Triangular,Idle Time,Open-source Implementation,Performance Bottleneck,LU Factorization,Coefficient Matrix,Input Matrix,Linear Algebra,Open-source Library,Specific Platform,Upper Triangular,Parallel Algorithm,Critical Path,Data Reuse,Supermatrix,Permutation Matrix,Remote Memory,Cache Hit,L2 Cache,Tile Size,Index Surgery,Data Cache,Performance Of Solver,Linearizable,Updated Matrix,Matrix Multiplication,Linear System
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要