Hydra: Deadline-Aware and Efficiency-Oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs

Zichao Yang,Heng Wu,Yuanjia Xu,Yuewen Wu,Hua Zhong,Wenbo Zhang

IEEE transactions on computers/IEEE transactions on computers（2023）

Cited 3|Views8

No score

Abstract

With the rapid proliferation of deep learning (DL) jobs running on heterogeneous GPUs, scheduling DL jobs to meet various scheduling requirements, such as meeting deadlines and reducing job completion time (JCT), is critical. Unfortunately, existing efficiency-oriented and deadline-aware efforts are still rudimentary. They lack the capability of scheduling jobs to meet deadline requirements while reducing total JCT, especially when the jobs have various execution times on heterogeneous GPUs. Therefore, we present Hydra, a novel quantitative cost comparison approach, to address this scheduling issue. Here, the cost represents the total JCT plus a dynamic penalty calculated from the total tardiness (i.e., the delay time of exceeding the deadline) of all jobs. Hydra adopts a sampling approach that exploits the inherent iterative periodicity of DL jobs to estimate job execution times accurately on heterogeneous GPUs. Then, Hydra considers various combinations of job sequences and GPUs to obtain the minimized cost by leveraging an efficient branch-and-bound algorithm. Finally, the results of evaluation experiments on Alibaba traces show that Hydra can reduce total tardiness by 85.8% while reducing total JCT as much as possible, compared with state-of-the-art efforts.

Translated text

Key words

Deadline-aware scheduler,deep learning,GPU cluster

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined