Level-3 BLAS on the TI C6678 Multi-core DSP

Computer Architecture and High Performance Computing(2012)

引用 30|浏览0
暂无评分
摘要
Digital Signal Processors (DSP) are commonly employed in embedded systems. The increase of processing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the development of multi-core DSPs as well as inclusion of floating point operations while maintaining low power dissipation. The eight-core DSP from Texas Instruments, codenamed TMS320C6678, provides a peak performance of 128GFLOPS (single precision) and an effective 32 GFLOPS(double precision) for only 10 watts. In this paper, we present the first complete implementation and report performance of the Level-3 Basic Linear Algebra Subprograms(BLAS) routines for this DSP. These routines are first optimized for single core and then parallelized over the different cores using OpenMP constructs. The results show that we can achieve about 8 single precision GFLOPS/watt and 2.2double precision GFLOPS/watt for General Matrix-Matrix multiplication (GEMM). The performance of the rest of theLevel-3 BLAS routines is within 90% of the corresponding GEMM routines.
更多
查看译文
关键词
digital signal processing chips,embedded systems,linear algebra,matrix multiplication,message passing,power aware computing,OpenMP construct,TI C6678 multicore DSP,TMS320C6678,Texas Instruments,cellular base-station,digital signal processor,embedded system,floating point operation,general matrix-matrix multiplication,industrial imaging system,level-3 basic linear algebra subprograms routine,low power dissipation,medical imaging system,power 10 W,radio controller,BLAS,DSPs,linear algebra
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要