Whale: Scaling Deep Learning Model Training to the Trillions

Xianyan Jia,Le Jiang,Ang Wang,Jie Zhang,Xinyuan Li,Wencong Xiao,Langshi chen,Yong Li,Zhen Zheng,Xiaoyong Liu,Wei Lin

arXiv (Cornell University)（2020）

Cited 0|Views38

No score

Abstract

Scaling up deep neural networks has been proven effective in improving model quality, while it also brings ever-growing training challenges. This paper presents Whale, an automatic and hardwareaware distributed training framework for giant models. Whale generalizes the expression of parallelism with four primitives, which can define various parallel strategies, as well as flexible hybrid strategies including combination and nesting patterns. It allows users to build models at an arbitrary scale by adding a few annotations and automatically transforms the local model to a distributed implementation. Moreover, Whale is hardware-aware and highly efficient even when training on GPUs of mixed types, which meets the growing demand of heterogeneous training in industrial clusters. Whale sets a milestone for training the largest multimodal pretrained model M6. The success of M6 is achieved by Whale’s design to decouple algorithm modeling from system implementations, i.e., algorithm developers can focus on model innovation, since it takes only three lines of code to scale the M6 model to trillions of parameters on a cluster of 480 GPUs.

Translated text

Key words

deep learning model training,deep learning,trillions

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined