SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning

Hadeel Albahar, Shruti Dongare, Yanlin Du,Nannan Zhao,Arnab Kumar Paul,Ali Raza Butt

2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)（2022）

引用 4|浏览50

暂无评分

摘要

Modern cluster management systems, such as Kubernetes, support heterogeneous workloads and resources. However, existing resource schedulers in these systems do not differentiate between heterogeneous G PU resources-which are becoming a norm-and do not support GPU sharing-which is necessary to support emerging collocation of jobs and multi-tenant applications. Thus the systems suffer from low GPU resource utilization, higher queuing delays, and an increase in application makespan, i.e., the duration between the arrival of the first job and the completion of the last job of a workflow. This is especially a problem in supporting crucial deep learning (DL) applications. To this end, in this paper, we profile and analyze DL jobs on heterogeneous GPUs, investigate the interference caused by collocating jobs on GPUs, and use this information to predict the GPU memory demand and job completion times. We propose SCHEDTUNE, a machine-learning-based heterogeneity-aware scheduler that ensures higher GPU memory utilization and reduced out-of-memory (OOM) failures, while supporting improved makespan. Our evaluation shows that SCHEDTUNE GPU memory predictors and scheduler outperform the state-of-the-art predictors by achieving 81% higher GPU memory utilization, 100% detection and avoidance of OOM errors, and 17.5% reduction in makespan compared to the default Kubernetes scheduler.

查看译文

关键词

Deep learning,Kubernetes,GPU sharing,Resource heterogeneity,Resource scheduling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要