Alleviating Load Imbalance in Data Processing for Large-Scale Deep Learning

Sarunya Pumma,Daniele Buono,Fabio Checconi,Xinyu Que,Wu-chun Feng

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)（2020）

引用 6|浏览78

暂无评分

摘要

Scalable deep learning remains an onerous challenge, as it is constrained by many factors, including those related to load imbalance. For many deep-learning software systems, multiple data-processing components—including neural network training, graph scheduling, input pipeline, and gradient synchronization—execute simultaneously and asynchronously. Such execution can cause the various data-processing components to contend with one another for the hardware resources, leading to severe load imbalance and, in turn, degraded scalability. In this paper, we present an in-depth analysis of state-of-the-art deep-learning software, TensorFlow and Horovod, to understand their scalability limitations. Based on this analysis, we propose four novel solutions that minimize resource contention and improve deep-learning performance by up to 35% for training various neural networks on 24,576 GPUs of the Summit supercomputer at Oak Ridge National Laboratory.

查看译文

关键词

scalable deep learning,load imbalance,resource contention,CPU,GPU,MPI,TensorFlow,Horovod

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要