gHA: An Efficient and Iterative Checkpointing Mechanism for Virtualized GPUs

ZiZhuo Zhang,Xinhao Xu,Mochi Xue,Jiajun Wang,Zhengwei Qi,Yaozu Dong

APSys（2016）

引用 5|浏览32

暂无评分

摘要

Graphic Process Unit (GPU) is the core for graphics computation. Because of its parallel nature it has potential to enhance the computation in a wide range. An example is the adoption of GPU virtualization in the cloud environment for high performance computing. However, GPU has its limitations. For various reasons GPU may hang in applications, whilst full GPU virtualization does not support live migration like CPU in Xen. The state-of-the-art method to get round is to reset the GPU hardware via a specific mechanism provided by GPU vendors. The disadvantage of such reset operation is that it causes application unavailability in order to maintain stability of the operating system. This paper presents an efficient method to address the above limitations. In particular, we start from the gVirt [28] method, and develop a novel checkpointing mechanism within our open source solution1 which we define as High Availability gVirt (gHA for short). The key of our scheme is to checkpoint and back up the whole Virtual Machine (VM) to a new host. In circumstances when GPU hangs, the backup VM takes over to guarantee the high availability of the virtualized environment. Not surprisingly, overhead occurs when our checkpointing mechanism is applied. Our tests show that downtime of the VM backup is between 220 ms and 400 ms, only 100-200 ms more than the idle VM without GPU virtualization support, which is a quite acceptable result. For numerous GPU workload tests, our evaluation shows that different GPU workloads achieve 65-92% of gVirt performance, which is a good trade-off between performance and stability. Our solution occupies 80-180Mbps bandwidth during execution, which is fairly manageable as the total bandwidth of 1Gbps between the two hosts.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要