谷歌浏览器插件
订阅小程序
在清言上使用

Revisiting Linpack Algorithm on Large-Scale CPU-GPU Heterogeneous Systems

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming(2020)

引用 2|浏览64
暂无评分
摘要
As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it's increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to CPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves 63.79PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by 5% on average.
更多
查看译文
关键词
Linpack algorithm,software pipeline,heterogeneous system
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要