Reducing Underflow in Mixed Precision Training by Gradient Scaling

Ruizhe Zhao,Brian Vogel,Tanvir Ahmed,Wayne Luk

IJCAI 2020（2020）

引用 5|浏览112

暂无评分

摘要

By leveraging the half-precision floating-point format (FP16) well supported by recent GPUs, mixed precision training (MPT) enables us to train larger models under the same or even smaller budget. However, due to the limited representation range of FP16, gradients can often experience severe underflow problems that hinder backpropagation and degrade model accuracy. MPT adopts loss scaling, which scales up the loss value just before backpropagation starts, to mitigate underflow by enlarging the magnitude of gradients. Unfortunately, scaling once is insufficient: gradients from distinct layers can each have different data distributions and require non-uniform scaling. Heuristics and hyperparameter tuning are needed to minimize these side-effects on loss scaling. We propose gradient scaling, a novel method that analytically calculates the appropriate scale for each gradient on-the-fly. It addresses underflow effectively without numerical problems like overflow and the need for tedious hyperparameter tuning. Experiments on a variety of networks and tasks show that gradient scaling can improve accuracy and reduce overall training effort compared with the state-of-the-art MPT.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要