Revisiting Linpack Algorithm on Large-Scale CPU-GPU Heterogeneous Systems

Chaoyang Shui,Xianzhi Yu,Yujin Yan,Yinshan Wang,Ke Meng,Guangming Tan

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming（2020）

引用 2|浏览64

暂无评分

摘要

As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it's increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to CPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves 63.79PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by 5% on average.

查看译文

关键词

Linpack algorithm,software pipeline,heterogeneous system

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要