We have presented the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs
Benchmarking GPUs to tune dense linear algebra
SC, pp.1-11E, (2008)
We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on t...更多
下载 PDF 全文
- The authors show an LU, QR and Cholesky factorization that achieve computational rates over 300 Gflop/s on a GPU
- These are three of the most widely used factorizations in dense linear algebra and pave the way for the implementation of the entire LAPACK library [Anderson et al 1990] for the GPUs. The authors' results include performance on the 8-series of NVIDIA GPUs that was not previously attained in the 1.5 years since these GPUs were available.
- In the approach the authors think of the GPU as a multithreaded vector unit and the best algorithms were found to closely resemble earlier solutions found for vector processors
- We show an LU, QR and Cholesky factorization that achieve computational rates over 300 Gflop/s on a GPU
- Our results include performance on the 8-series of NVIDIA GPUs that was not previously attained in the 1.5 years since these GPUs were available
- In our approach we think of the GPU as a multithreaded vector unit and our best algorithms were found to closely resemble earlier solutions found for vector processors
- We have presented the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs
- We presented detailed benchmarks of the GPU memory system, kernel start-up costs, and arithmetic throughput, which are important to understanding the limits of performance of many algorithms including our own
- Consider evaluating the product C := C + AB, where A, B and C are muk, kun and mun matrices resp
- Partition these matrices into MuK, KuN and MuN grids of bmubk, bkubn and bmubn blocks.
- There are MN blocks in C, so in total these fetches consume MNKbmbk + MNKbkbn = mnk(1/bn+1/bm) words of bandwidth.
- This is 2/(1/bn+1/bm) times less than if no blocking is used, i.e. if bm = bn = bk = 1.
- Blocks in A and B don’t have to be square for this technique to work
- Performance Results and Analysis
Table 2 shows the fractions of peak and Figures 5 and 6 show the absolute rates achieved in the implementations of matrixmatrix multiplies.
- This is over 2× slower than achieved in the code
- They follow traditional guidelines as those outlined in the CUDA programming guide, e.g. use longer vectors, optimize for fewer registers per scalar thread, get higher occupancy and use shared memory as the primary local storage space.
- CUBLAS’s code uses 2u less registers per scalar thread and runs at higher occupancy of 2u more warps per core
- It is 1.6u slower.
- The authors have presented the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs. Based on the performance benchmarking and modeling, they attain 80%90% of the peak speeds possible for large matrices.
- Based on the performance benchmarking and modeling, they attain 80%90% of the peak speeds possible for large matrices
- This speed was achieved by carefully choosing optimizations to match the capabilities of the hardware, including using the CPU in parallel with the GPU to perform panel factorizations, which are dominated by BLAS1 and BLAS2 operations done faster on the CPU.
- The authors highlighted some of the optimization guidelines, such as using shorter vectors at the program level and using register file as the primary on-chip storage space
- Table1: The list of the GPUs used in this study. SP is single precision and DP is double precision. Smem is shared memory. Peak flop rates are shown for multiply and add operations. Flops:word is the ratio of peak Gflop/s rate to pin-memory bandwidth in words
- Table2: The estimated and the best observed rates in matrix-matrix multiply routines shown as a fraction of the peak
- Table3: Details of our code and the code in CUBLAS 1.1. Instructions counts are for the inner loop only and were obtained using decuda. A’s 64×1 blocks are given as defined in the Clevel program. This block size is increased when compiling by unrolling the loop and assigning the blocks fetched in different iterations to different registers
- Table4: Comparison of best Gflop/s rates in the CPU and GPU versions and best speedup vs. the CPU-alone versions. SGEMM
- Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor’s implementation and approaches the peak of hardware capabilities
- We were able to achieve 98% of the arithmetic peak in registerto-register multiply-and-add instructions
- The breakdown shows that up to 90% of the runtime is consumed by computing on the GPU and about of 10% of this time overlaps with computing on the CPU
- A surprisingly large speedup (up to 30%) was obtained by performing triangular solve via multiplying by the inverse matrix
- Effect of all optimizations decreases at larger problem sizes, where time is dominated by matrix-matrix multiplies. Rates in these multiplies are affected by using 2-level schemes in LU and Cholesky and using autotuning to choose block size in QR. These techniques gave up to 47% speedup and factored in only for n > 4096
- ABTS, D., BATAINEH, A., SCOTT, S., FAANES, G., SCHWARZMEIER, J., LUNDBERG, E., JOHNSON, T., BYE, M., AND SCHWOERER, G. 2007. The Cray BlackWidow: A Highly Scalable Vector Multiprocessor, SC’07.
- AGARWAL R. C., AND GUSTAVSON, F.G. 1989. Vector and parallel algorithms for Cholesky factorization on IBM 3090, Supercomputing’89, 225–233.
- ALVERSON, R., CALLAHAN, D., CUMMINGS, D., KOBLENZ, B., PORTERFIELD, A., AND SMITH, B. 1990. The Tera Computer System, ICS’90, 1–6.
- AMD. 2006. ATI CTM Guide, version 1.01. ANDERSON, E., BAI, Z., DONGARRA, J., GREENBAUM, A., MCKENNEY, A., DU CROZ, J., HAMMERLING, S., DEMMEL, J., BISCHOF, C., AND SORENSEN, D. 1990. LAPACK: a portable linear algebra library for high-performance computers, Supercomputing’90, 2–11.
- ANDERSON, E., BRANDT, M., AND YANG, C. 2004. LINPACK Benchmark Optimizations on a Virtual Processor Grid, In Cray User Group 2004 Proceedings. BABOULIN, M., DONGARRA J., AND TOMOV, S. 2008. Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures, Technical Report UT-CS-08-200, University of Tennessee, May 6, 2008 (also LAPACK Working Note 200). BARRACHINA, S., CASTILLO, M., IGUAL, F. D., MAYO, R, AND QUINTANA-ORTI, E. S. 2008. Solving Dense Linear Systems on Graphics Processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February 2008. BASKARAN, M., BONDHUGULA, U., KRISHNAMOORTHY, S., RAMANUJAM, J., ROUNTEV, A., AND SADAYAPPAN, P. 2008. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs, ISC’08. BISCHOF, C. H., AND LACROUTE, P. G. 1990. An adaptive blocking strategy for matrix factorization, in Proceedings of the Joint International Conference on Vector and Parallel Processing, 210–221.
- ORTI, E. S., QUINTANA-ORTI, G., VAN DE GEIJN, R., AND VAN ZEE, F. G. 2008. Making Programming Synonymous with Programming for Linear Algebra Libraries, FLAME Working Note #31. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-20, April 17, 2008. CHOI, J., DONGARRA, J. J., OSTROUCHOV, L. S., PETITET, A. P., WALKER, D. W., AND WHALEY, R. C. 199The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming 5, 3, 173–184 (also LAPACK Working Note 80). DONGARRA, J., DUFF, I. S., SORENSEN, D. C., AND VAN DER VORST, H. A. 1998. Numerical Linear Algebra for HighPerformance Computers, SIAM. DONGARRA, J. J., DU CROZ, J., HAMMARLING, S., AND DUFF, I. 1990. A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software 16, 1, 1–17.
- DONGARRA, J., AND OSTROUCHOV, S. 1990. LAPACK Block Factorization Algorithms on the Intel iPSC/860, Technical Report CS-90-115, University of Tennessee (also LAPACK Working Note 24). GALOPPO, N., GOVINDARAJU, N. K., HENSON, M., AND MANOCHA, D. 2005. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware, SC’05. GOVINDARAJU, N. K., LARSEN, S., GRAY, J., AND MANOCHA, D. 2006. A Memory Model for Scientific Algorithms on Graphcs Processors, SC’06. FATAHALIAN, K., SUGERMAN, J., AND HANRAHAN, P. 2004. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication, In Graphics Hardware 2004, 133–137.
- Symposium on Principles and Practice of Parallel Programming, ACM Press, 2008, 73–82.