Optimizing Non-Contiguous Memory Access On Intel Xeon Phi Coprocessors

Mingfei Ma,Jinlong Hou, Jason Ye,Meena Arunachalam, Rafael Gutierrez

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems（2015）

引用 2|浏览5

暂无评分

摘要

As an innovative design for high performance computing, Intel Xeon Phi coprocessor based on Intel Many Integrated Core (Intel MIC) architecture relies heavily on its SIMD (single instruction multiple data) unit. However, performance of non-contiguous memory access has become the memory wall towards efficient utilization of SIMD unit on Intel Xeon Phi coprocessors due to gather/scatter overhead. Existing vectorization techniques in the optimization of gather/scatter overhead have been focusing on extracting data parallelism from inter-loop and intra-loop in a decoupled means. In this paper, we propose a novel inter-intra-hybrid vectorization technique which further exploits SIMD efficiency. In this technique, we generate optimized SIMD code for loops requesting non-contiguous memory. Additional strategies are also presented to improve SIMD unit parallelism through data padding and redundant computation. To evaluate our technique, the two major functions from Sandia's miniMD benchmark, i.e., LJ force calculation and neighbor list build, are taken for experiments which show that our proposed method achieves a performance gain of 25%-40% compared with Intel compiler auto vectorized code and outperforms the existing methods. Our optimization method can be further applied to other highly parallel workloads with frequent non-contiguous memory access, which is very common in real-world scientific applications.

查看译文

关键词

high performance computing,gather/scatter,vectorization technique,performance optimization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要