TAFE: Thread Address Footprint Estimation for Capturing Data/Thread Locality in GPU Systems

PACT '20: International Conference on Parallel Architectures and Compilation Techniques Virtual Event GA USA October, 2020(2020)

引用 6|浏览8
暂无评分
摘要
In multi-GPU and multi-chiplet GPU systems exhibiting NUMA behavior, information about addresses accessed by threads is crucial for various optimizations such as data/thread co-location and cache/scratchpad memory management. To make optimal decisions and avoid runtime overhead, knowledge about dynamic, potentially data-dependent access patterns should be available before kernel execution. Existing approaches require rewriting of applications or can only capture static, data-independent patterns. In this paper, we propose TAFE, a framework for accurate dynamic thread address footprint estimation of GPU applications. TAFE combines minimal static address pattern annotations with dynamic data dependency tracking to compute threadblock-specific address footprints prior to kernel launch. We propose a low-overhead software mechanism to track dynamic data-dependencies and provide an optional lightweight hardware extension to support transparent tracking. We evaluate TAFE on different NUMA GPU system configurations. TAFE achieves 91% estimation accuracy across a wide range of access patterns while incurring less than 3% tracking and estimation overhead. We further demonstrate benefits of using TAFE for efficient data/compute co-location. A TAFE-optimized thread/page mapping, can reduce off-chip traffic by 23% (up to 62%) while requiring only minimal, architecture-oblivious annotations from programmer. Furthermore, a TAFE-optimized system achieves on average 45% and 32% (up to 2x) higher performance compared to an unoptimized baseline and 10% and 22% over existing static, data-independent schemes across multiple system configurations.
更多
查看译文
关键词
Data/thread locality, NUMA GPU, data and compute partitioning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要