BlockMaestro: Enabling Programmer-Transparent Task-based Execution in GPU Systems

AmirAli Abdolrashidi,Hodjat Asghari Esfeden,Ali Jahanshahi, Kaustubh Singh,Nael B. Abu-Ghazaleh,Daniel Wong

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)（2021）

引用 12|浏览44

暂无评分

摘要

As modern GPU workloads grow in size and complexity, there is an ever-increasing demand for GPU computational power. Emerging workloads contain hundreds or thousands of GPU kernel launches, which incur high overheads, and exhibit data-dependent behavior between kernels, which requires synchronization, leading to GPU under-utilization. Task-based execution models have been proposed to solve these issues, but they require significant programmer effort to port applications to proprietary task-based programming models in order to specify tasks and task dependencies. To address this need, we propose BlockMaestro, a software-hardware solution that combines command queue reordering, kernel-launch-time static analysis, and runtime hardware support to dynamically identify and resolve thread-block level data dependencies between kernels. Through static analysis of memory access patterns at kernel-launch-time, BlockMaestro can extract inter-kernel thread block-level data dependencies. BlockMaestro also introduces kernel pre-launching to reduce the kernel launch overheads experienced by multiple dependent kernels. Correctness is enforced by dynamically resolving thread block-level data dependency at runtime through hardware support. BlockMaestro achieves an average speedup of 51.76% (up to 2.92x) on data-dependent benchmarks, and requires minimal hardware overhead.

查看译文

关键词

GPGPU,SIMD,Data Dependency,Thread Block Scheduling,Just-in-time

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要