Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Sébastien Darche,Michel R. Dagenais

ACM Transactions on Parallel Computing(2023)

Cited 0|Views5
No score
While GPUs can bring substantial speedup to compute-intensive tasks, their programming is notoriously hard. From their programming model, to microarchitectural particularities, the programmer may encounter many pitfalls which may hinder performance in obscure ways. Numerous performance analysis tools provide helpful data on the efficiency of the compute kernels, but few allow the programmer to efficiently gather runtime information directly on the device and pinpoint the sections to optimize. We propose in this paper an instrumentation method to collect traces while executing the compute kernel, with a reduced overhead compared to other approaches, by exploiting the inherently parallel behavior of GPUs and compartmentalizing tracing phases. The reference implementation is freely available and induces an average overhead of 1.6 × on a popular scientific computing benchmark and 1.5 × over the kernel execution time. This represents an improvement of an order of magnitude compared to similar work, and proves useful for timing-guided optimizations. The tool generates insightful execution traces and timestamps which can be analyzed to better understand performance issues in the kernel.
Translated text
Key words
GPU Programming,Software Tracing,Performance Analysis
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined