ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning.

Diandian Gu,Yihao Zhao,Yinmin Zhong,Yifan Xiong,Zhenhua Han,Peng Cheng,Fan Yang,Gang Huang,Xin Jin,Xuanzhe Liu

ASPLOS (2)（2023）

Cited 10|Views127

No score

Abstract

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: ( i ) users specify only the deep neural network (DNN) model and hyperparameters for a job, but not the number of GPUs; ( ii ) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines while alleviating tedious, low-level, and manual resource management for deep learning developers. The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.46–7.65× compared to existing solutions.

Translated text

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined