Optimizing large scale CUDA applications using input data specific optimizations

CVMP(2014)

引用 0|浏览2
暂无评分
摘要
CUDA applications and general-purpose GPU (GPGPU) programs are widely used nowadays for solving computationally intensive tasks. There is a substantial effort in the form of tools, papers, books and features that are targeted at GPGPU APIs such as CUDA and OpenCL. The GPU architecture, being substantially different from the traditional CPU ones (x86, PowerPC, ARM) requires a different approach and introduces a different set of challenges. Apart from the traditional and well examined GPGPU problems - such as memory access patterns, parallel designs and occupancy, there is yet another really important, but not well studied setback - from one point onward, the bigger the CUDA application gets (in terms of lines of code) the slower it becomes, mostly due to register spilling. Register spilling is more or less a problem for most of the available architectures today, but it can easily become a massive bottleneck on the GPU due to its nature. We are going to examine in detail why this happens, what are the common ways to solve it, and we are going to propose one simple, presently undocumented approach that may be used to alleviate the issue in some situations. For the purpose of this paper we will focus on the NVidia Fermi Architecture
更多
查看译文
关键词
computer vision
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要