CUDA-DTM: Distributed Transactional Memory for GPU Clusters
NETYS(2019)
摘要
We present CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive auto-coherence scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability. We extend GPU Software Transactional Memory to allow threads across many GPUs to access a coherent distributed shared memory space and propose a scheme for GPU-to-GPU communication using CUDA-Aware MPI. The performance of CUDA-DTM is evaluated using a suite of seven irregular memory access benchmarks with varying degrees of compute intensity, contention, and node-to-node communication frequency. Using a cluster of 256 devices, our experiments show that GPU clusters using CUDA-DTM can be up to 115x faster than CPU clusters.
更多查看译文
关键词
Distributed Transactional Memory,GPU cluster,CUDA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络