Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods.

ACM Trans. Math. Softw.（2017）

引用 35|浏览64

暂无评分

摘要

In this article, we explore the implementation of complex matrix multiplication. We begin by briefly identifying various challenges associated with the conventional approach, which calls for a carefully written kernel that implements complex arithmetic at the lowest possible level (i.e., assembly language). We then set out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether. This constraint promotes code reuse and portability within libraries such as Basic Linear Algebra Subprograms and BLAS-Like Library Instantiation Software (BLIS) and allows kernel developers to focus their efforts on fewer and simpler kernels. We develop two alternative approaches—one based on the 3m method and one that reflects the classic 4m formulation—each with multiple variants, all of which rely only on real matrix multiplication kernels. We discuss the performance characteristics of these “induced” methods and observe that the assembly-level method actually resides along the 4m spectrum of algorithmic variants. Implementations are developed within the BLIS framework, and testing on modern hardware confirms that while the less numerically stable 3m method yields the fastest runtimes, the more stable (and thus widely applicable) 4m method’s performance is somewhat limited due to implementation challenges that appear inherent in nature.

查看译文

关键词

Linear algebra,DLA,high-performance,complex,matrix,multiplication,micro-kernel,kernel,BLAS,BLIS,3m,4m,induced

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要