Memory architecture and data locality

Programming Massively Parallel Processors(2023)

引用 0|浏览2
暂无评分
摘要
This chapter introduces the on-chip memory architecture of GPUs, the concept of memory-bound applications, and techniques for improving the performance of memory-bound applications. The chapter uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique by which barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. However, the tiling techniques involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks that are needed for a tiled kernel to be applicable to arbitrary matrix sizes. The chapter concludes with an overview of how usage of shared memory and registers can affect the number of thread blocks that can be accommodated in each streaming multiprocessor.
更多
查看译文
关键词
data locality,memory,architecture
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要