Optimizing Overlapped Memory Accesses In User-Directed Vectorization

ICS(2015)

引用 7|浏览108
暂无评分
摘要
Current processors incorporate wide and powerful vector units whose optimal exploitation is crucial to reach peak performance. However, present autovectorizing compilers fall short of that goal. Exploiting some vector instructions requires aggressive approaches that are not affordable in production compilers. Thus, advanced programmers pursuing the best performance from their applications are compelled to manually vectorize them using low-level SIMD intrinsics.We propose a user-directed code optimization that targets overlapped vector loads, i.e., vector loads that read scalar elements redundantly from memory. Instead, our optimization loads these elements once and combines them using advanced register-to-register vector instructions. This code is potentially more efficient and it uses advanced vector instructions that compilers do not widely exploit automatically. We also extend the OpenMP* SIMD directives with a new clause called overlap that allows users to easily enable and tune this optimization on demand. We implement our proposal for the Intel (R) Xeon Phi (TM) coprocessor.Our evaluation shows up to 29% speed-up over five highly-optimized stencil kernels and workloads from real-world applications. Results also demonstrate how important user hints are to maximize performance.
更多
查看译文
关键词
SIMD,Vectorization,Compiler Optimization,OpenMP,Stencil,Intel Many Integrated Core Architecture
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要