AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have presented the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs

Benchmarking GPUs to tune dense linear algebra

SC, pp.1-11E, (2008)

被引用923|浏览109
EI WOS
下载 PDF 全文
引用
微博一下

摘要

We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on t...更多

代码

数据

0
简介
  • The authors show an LU, QR and Cholesky factorization that achieve computational rates over 300 Gflop/s on a GPU
  • These are three of the most widely used factorizations in dense linear algebra and pave the way for the implementation of the entire LAPACK library [Anderson et al 1990] for the GPUs. The authors' results include performance on the 8-series of NVIDIA GPUs that was not previously attained in the 1.5 years since these GPUs were available.
  • In the approach the authors think of the GPU as a multithreaded vector unit and the best algorithms were found to closely resemble earlier solutions found for vector processors
重点内容
  • We show an LU, QR and Cholesky factorization that achieve computational rates over 300 Gflop/s on a GPU
  • Our results include performance on the 8-series of NVIDIA GPUs that was not previously attained in the 1.5 years since these GPUs were available
  • In our approach we think of the GPU as a multithreaded vector unit and our best algorithms were found to closely resemble earlier solutions found for vector processors
  • We have presented the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs
  • We presented detailed benchmarks of the GPU memory system, kernel start-up costs, and arithmetic throughput, which are important to understanding the limits of performance of many algorithms including our own
方法
  • Consider evaluating the product C := C + AB, where A, B and C are muk, kun and mun matrices resp
  • Partition these matrices into MuK, KuN and MuN grids of bmubk, bkubn and bmubn blocks.
  • There are M˜N blocks in C, so in total these fetches consume M˜N˜K˜bm˜bk + M˜N˜K˜bk˜bn = m˜n˜k˜(1/bn+1/bm) words of bandwidth.
  • This is 2/(1/bn+1/bm) times less than if no blocking is used, i.e. if bm = bn = bk = 1.
  • Blocks in A and B don’t have to be square for this technique to work
结果
  • Performance Results and Analysis

    Table 2 shows the fractions of peak and Figures 5 and 6 show the absolute rates achieved in the implementations of matrixmatrix multiplies.
  • This is over 2× slower than achieved in the code
  • They follow traditional guidelines as those outlined in the CUDA programming guide, e.g. use longer vectors, optimize for fewer registers per scalar thread, get higher occupancy and use shared memory as the primary local storage space.
  • CUBLAS’s code uses 2u less registers per scalar thread and runs at higher occupancy of 2u more warps per core
  • It is 1.6u slower.
结论
  • The authors have presented the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs. Based on the performance benchmarking and modeling, they attain 80%90% of the peak speeds possible for large matrices.
  • Based on the performance benchmarking and modeling, they attain 80%90% of the peak speeds possible for large matrices
  • This speed was achieved by carefully choosing optimizations to match the capabilities of the hardware, including using the CPU in parallel with the GPU to perform panel factorizations, which are dominated by BLAS1 and BLAS2 operations done faster on the CPU.
  • The authors highlighted some of the optimization guidelines, such as using shorter vectors at the program level and using register file as the primary on-chip storage space
表格
  • Table1: The list of the GPUs used in this study. SP is single precision and DP is double precision. Smem is shared memory. Peak flop rates are shown for multiply and add operations. Flops:word is the ratio of peak Gflop/s rate to pin-memory bandwidth in words
  • Table2: The estimated and the best observed rates in matrix-matrix multiply routines shown as a fraction of the peak
  • Table3: Details of our code and the code in CUBLAS 1.1. Instructions counts are for the inner loop only and were obtained using decuda. A’s 64×1 blocks are given as defined in the Clevel program. This block size is increased when compiling by unrolling the loop and assigning the blocks fetched in different iterations to different registers
  • Table4: Comparison of best Gflop/s rates in the CPU and GPU versions and best speedup vs. the CPU-alone versions. SGEMM
Download tables as Excel
基金
  • Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor’s implementation and approaches the peak of hardware capabilities
  • We were able to achieve 98% of the arithmetic peak in registerto-register multiply-and-add instructions
  • The breakdown shows that up to 90% of the runtime is consumed by computing on the GPU and about of 10% of this time overlaps with computing on the CPU
  • A surprisingly large speedup (up to 30%) was obtained by performing triangular solve via multiplying by the inverse matrix
  • Effect of all optimizations decreases at larger problem sizes, where time is dominated by matrix-matrix multiplies. Rates in these multiplies are affected by using 2-level schemes in LU and Cholesky and using autotuning to choose block size in QR. These techniques gave up to 47% speedup and factored in only for n > 4096
引用论文
  • ABTS, D., BATAINEH, A., SCOTT, S., FAANES, G., SCHWARZMEIER, J., LUNDBERG, E., JOHNSON, T., BYE, M., AND SCHWOERER, G. 2007. The Cray BlackWidow: A Highly Scalable Vector Multiprocessor, SC’07.
    Google ScholarFindings
  • AGARWAL R. C., AND GUSTAVSON, F.G. 1989. Vector and parallel algorithms for Cholesky factorization on IBM 3090, Supercomputing’89, 225–233.
    Google ScholarLocate open access versionFindings
  • ALVERSON, R., CALLAHAN, D., CUMMINGS, D., KOBLENZ, B., PORTERFIELD, A., AND SMITH, B. 1990. The Tera Computer System, ICS’90, 1–6.
    Google ScholarLocate open access versionFindings
  • AMD. 2006. ATI CTM Guide, version 1.01. ANDERSON, E., BAI, Z., DONGARRA, J., GREENBAUM, A., MCKENNEY, A., DU CROZ, J., HAMMERLING, S., DEMMEL, J., BISCHOF, C., AND SORENSEN, D. 1990. LAPACK: a portable linear algebra library for high-performance computers, Supercomputing’90, 2–11.
    Google ScholarFindings
  • ANDERSON, E., BRANDT, M., AND YANG, C. 2004. LINPACK Benchmark Optimizations on a Virtual Processor Grid, In Cray User Group 2004 Proceedings. BABOULIN, M., DONGARRA J., AND TOMOV, S. 2008. Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures, Technical Report UT-CS-08-200, University of Tennessee, May 6, 2008 (also LAPACK Working Note 200). BARRACHINA, S., CASTILLO, M., IGUAL, F. D., MAYO, R, AND QUINTANA-ORTI, E. S. 2008. Solving Dense Linear Systems on Graphics Processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February 2008. BASKARAN, M., BONDHUGULA, U., KRISHNAMOORTHY, S., RAMANUJAM, J., ROUNTEV, A., AND SADAYAPPAN, P. 2008. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs, ISC’08. BISCHOF, C. H., AND LACROUTE, P. G. 1990. An adaptive blocking strategy for matrix factorization, in Proceedings of the Joint International Conference on Vector and Parallel Processing, 210–221.
    Google ScholarFindings
  • ORTI, E. S., QUINTANA-ORTI, G., VAN DE GEIJN, R., AND VAN ZEE, F. G. 2008. Making Programming Synonymous with Programming for Linear Algebra Libraries, FLAME Working Note #31. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-20, April 17, 2008. CHOI, J., DONGARRA, J. J., OSTROUCHOV, L. S., PETITET, A. P., WALKER, D. W., AND WHALEY, R. C. 199The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming 5, 3, 173–184 (also LAPACK Working Note 80). DONGARRA, J., DUFF, I. S., SORENSEN, D. C., AND VAN DER VORST, H. A. 1998. Numerical Linear Algebra for HighPerformance Computers, SIAM. DONGARRA, J. J., DU CROZ, J., HAMMARLING, S., AND DUFF, I. 1990. A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software 16, 1, 1–17.
    Google ScholarFindings
  • DONGARRA, J., AND OSTROUCHOV, S. 1990. LAPACK Block Factorization Algorithms on the Intel iPSC/860, Technical Report CS-90-115, University of Tennessee (also LAPACK Working Note 24). GALOPPO, N., GOVINDARAJU, N. K., HENSON, M., AND MANOCHA, D. 2005. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware, SC’05. GOVINDARAJU, N. K., LARSEN, S., GRAY, J., AND MANOCHA, D. 2006. A Memory Model for Scientific Algorithms on Graphcs Processors, SC’06. FATAHALIAN, K., SUGERMAN, J., AND HANRAHAN, P. 2004. Understanding the efficiency of GPU algorithms for matrixmatrix multiplication, In Graphics Hardware 2004, 133–137.
    Google ScholarFindings
  • Symposium on Principles and Practice of Parallel Programming, ACM Press, 2008, 73–82.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科