Revisiting Linpack Algorithm on Large-Scale CPU-GPU Heterogeneous Systems
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming(2020)
摘要
As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it's increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to CPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves 63.79PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by 5% on average.
更多查看译文
关键词
Linpack algorithm,software pipeline,heterogeneous system
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要