Optimized HPL for AMD GPU and multi-core CPU usage

Matthias Bach,Matthias Kretz,Volker Lindenstruth,David Rohr

Computer Science - Research and Development（2011）

引用 59|浏览2

暂无评分

摘要

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/?51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497 GFlop/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623 GFlop/s (83.6% of the peak) are achieved. The HPL ( http://www.netlib.org/benchmark/hpl/algorithm.html ) benchmark was modified to perform well with one MPI-process per node. The modifications include multi-threading, vectorization, use of the GPU DGEMM, cache optimizations, and a new Lookahead algorithm. A Linpack performance of 70% theoretical peak is achieved and this performance scales linearly to hundreds of nodes.

查看译文

关键词

theoretical peak,performance scale,combined gpu,gpu load,multi-core cpu usage,amd gpu,linpack performance,gpu dgemm,cpu usage,dgemm library,fast dgemm,new lookahead algorithm,optimized hpl

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要