Whale: Scaling Deep Learning Model Training to the Trillions

arXiv (Cornell University)(2020)

Cited 0|Views38
No score
Abstract
Scaling up deep neural networks has been proven effective in improving model quality, while it also brings ever-growing training challenges. This paper presents Whale, an automatic and hardwareaware distributed training framework for giant models. Whale generalizes the expression of parallelism with four primitives, which can define various parallel strategies, as well as flexible hybrid strategies including combination and nesting patterns. It allows users to build models at an arbitrary scale by adding a few annotations and automatically transforms the local model to a distributed implementation. Moreover, Whale is hardware-aware and highly efficient even when training on GPUs of mixed types, which meets the growing demand of heterogeneous training in industrial clusters. Whale sets a milestone for training the largest multimodal pretrained model M6. The success of M6 is achieved by Whale’s design to decouple algorithm modeling from system implementations, i.e., algorithm developers can focus on model innovation, since it takes only three lines of code to scale the M6 model to trillions of parameters on a cluster of 480 GPUs.
More
Translated text
Key words
deep learning model training,deep learning,trillions
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined