Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors

Jongsoo Park,Mikhail Smelyanskiy,Karthikeyan Vaidyanathan,Alexander Heinecke,Dhiraj D. Kalamkar,Md. Mostofa Ali Patwary,Vadim O. Pirogov,Pradeep Dubey,Xing Liu,Carlos Rosales, Cyril Mazauric,Christopher Daley

Periodicals（2016）

引用 9|浏览143

暂无评分

摘要

AbstractThis paper presents optimizations in a high-performance conjugate gradient benchmark HPCG for multi-core Intel® Xeon® processors and many-core Xeon Phi coprocessors. Without careful optimization, the HPCG benchmark under-utilizes the compute resources available in modern processors due to its low arithmetic intensity and challenges in parallelizing the Gauss-Seidel smoother GS. Our optimized implementation fuses GS with sparse matrix vector multiplication SpMV to address the low arithmetic intensity, overcoming the performance otherwise bound by memory bandwidth. This fusion optimization is progressively more effective in newer generation Xeon processors, demonstrating the usefulness of their larger caches for sparse matrix operations: Sandy Bridge, Ivy Bridge, and Haswell processors achieve 93%, 99%, and 103%, respectively, of the ideal performance with a constraint that matrices are streamed from memory. Our implementation also parallelizes GS using fine-grain level-scheduling, a method that has been believed not to scale with many cores. Our GS implementation scales with 60 cores in Xeon Phi coprocessors, for the finest level of the multi-grid pre-conditioner. At the coarser levels, we address the limited parallelism using block multi-color re-ordering, achieving 21 GFLOPS with one Xeon Phi coprocessor. These optimizations distinguish our HPCG implementation from the others that stream most of the data from main memory and rely on multi-color re-ordering for parallelism. Our optimized implementation has been evaluated in clusters with various configurations, and we find that low-diameter high-radix network topologies such as Dragonfly realize high parallelization efficiencies because of fast all-reduce collectives. In addition, we demonstrate that our optimizations not only benefit the HPCG dataset, which is based on a structured 3D grid, but also a wide range of unstructured matrices.

查看译文

关键词

High-performance conjugate gradient, HPCG, conjugate gradient, Xeon Phi, Gauss-Seidel, multi-grid, loop fusion, directed acyclic graph, task scheduling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要