AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We explore the underlying principle of the effectiveness of the warmup heuristic used for adaptive optimization algorithms

On the Variance of the Adaptive Learning Rate and Beyond

ICLR, (2020)

被引用434|浏览316
EI
下载 PDF 全文
引用
微博一下

摘要

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate -- its variance is problematically large...更多
简介
  • Training loss Overlapped CAdam Adam-warmup RAdam.
  • It has been observed that these optimization methods may converge to bad/suspicious local optima, and have to resort to a warmup heuristic – using a small learning rate in the first few epochs of training to mitigate such problem (Vaswani et al, 2017; Popel & Bojar, 2018).
  • Similar phenomena are observed in other scenarios like BERT pre-training (Devlin et al, 2019)
重点内容
  • Adam-eps Adam-2k Adam-vanilla

    Fast and stable optimization algorithms are what generations of researchers have been pursuing (Gauss, 1823; Cauchy, 1847)
  • We show that its root cause is: the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used
  • Inspired by our analysis results, we propose a new variant of Adam, called Rectified Adam (RAdam), which explicitly rectifies the variance of the adaptive learning rate based on derivations
  • We show that the convergence issue is due to the undesirably large variance of the adaptive learning rate in the early stage of model training
  • We explore the underlying principle of the effectiveness of the warmup heuristic used for adaptive optimization algorithms
  • We identify that, due to the limited amount of samples in the early stage of model training, the adaptive learning rate has an undesirably large variance and can cause the model to converge to suspicious/bad local optima. We provide both empirical and theoretical evidence to support our hypothesis, and further propose a new variant
方法
  • With a consistent adaptive learning rate variance, our proposed method achieves similar performance to that of previous state-of-the-art warmup heuristics.
  • We found that RAdam requires less hyperparameter tuning
  • We visualize their learning curves in Figure 7.
  • When setting the learning rate as 0.1, Adam with 100 steps of warmup fails to get satisfying performance and only results in an accuracy of 90.13; RAdam successfully gets an accuracy of 91.06, with the original setting of the moving average calculation (i.e., β1 = 0.9, β2 = 0.999).
  • We conjecture the reason is due to the fact that RAdam, which is based on a rigorous variance analysis, explicitly avoids the extreme situation where the variance is divergent, and rectifies the variance to be consistent in other situations
结论
  • We explore the underlying principle of the effectiveness of the warmup heuristic used for adaptive optimization algorithms.
  • We identify that, due to the limited amount of samples in the early stage of model training, the adaptive learning rate has an undesirably large variance and can cause the model to converge to suspicious/bad local optima.
  • We provide both empirical and theoretical evidence to support our hypothesis, and further propose a new variant
总结
  • Introduction:

    Training loss Overlapped CAdam Adam-warmup RAdam.
  • It has been observed that these optimization methods may converge to bad/suspicious local optima, and have to resort to a warmup heuristic – using a small learning rate in the first few epochs of training to mitigate such problem (Vaswani et al, 2017; Popel & Bojar, 2018).
  • Similar phenomena are observed in other scenarios like BERT pre-training (Devlin et al, 2019)
  • Methods:

    With a consistent adaptive learning rate variance, our proposed method achieves similar performance to that of previous state-of-the-art warmup heuristics.
  • We found that RAdam requires less hyperparameter tuning
  • We visualize their learning curves in Figure 7.
  • When setting the learning rate as 0.1, Adam with 100 steps of warmup fails to get satisfying performance and only results in an accuracy of 90.13; RAdam successfully gets an accuracy of 91.06, with the original setting of the moving average calculation (i.e., β1 = 0.9, β2 = 0.999).
  • We conjecture the reason is due to the fact that RAdam, which is based on a rigorous variance analysis, explicitly avoids the extreme situation where the variance is divergent, and rectifies the variance to be consistent in other situations
  • Conclusion:

    We explore the underlying principle of the effectiveness of the warmup heuristic used for adaptive optimization algorithms.
  • We identify that, due to the limited amount of samples in the early stage of model training, the adaptive learning rate has an undesirably large variance and can cause the model to converge to suspicious/bad local optima.
  • We provide both empirical and theoretical evidence to support our hypothesis, and further propose a new variant
表格
  • Table1: Image Classification Method Acc. SGD 91.51 Adam 90.54 RAdam 91.38
  • Table2: BLEU score on Neural Machine Translation
  • Table3: Performance on CIFAR10 (lr = 0.1)
Download tables as Excel
基金
  • Research was sponsored in part by DARPA No W911NF-17-C-0099 and FA8750-19-2-1004, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, and DTRA HDTRA11810026
研究对象与分析
samples: 2000
To make comparison with other methods, its iterations are indexed from -1999 instead of 1. In Figure 1, we observe that, after getting these additional two thousand samples for estimating the adaptive learning rate, Adam-2k avoids the convergence problem of the vanilla-Adam. Also, comparing Figure 2 and Figure 3, getting large enough samples prevents the gradient distribution from being distorted

datasets: 3
The performances on language modeling (i.e., One Billion Word (Chelba et al, 2013)) and image classification (i.e., CIFAR10 (Krizhevsky et al, 2009) and ImageNet (Deng et al, 2009)) are presented in Figure 4, 5. The results show that RAdam outperforms Adam in all three datasets. As shown in Figure 4, although the rectification term makes RAdam slower than the vanilla Adam in the first few epochs, it allows RAdam to converge faster after that

datasets: 3
To examine the effectiveness of RAdam, we first conduct comparisons on neural machine translation, on which the state-of-the-art employs Adam with the linear warmup. Specifically, we conduct experiments on three datasets, i.e., IWSLT’14 De-En, IWSLT’14 En-De, and WMT’16 En-De. Due

引用论文
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing recurrent networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8624–8628. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for non-convex problems. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Augustin Cauchy. Methode generale pour la resolution des systemes dequations simultanees. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation,, 2014.
    Google ScholarLocate open access versionFindings
  • Ciprian Chelba, Tomas Mikolov, Michael Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH, 2013.
    Google ScholarFindings
  • Jinghui Chen and Quanquan Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
    Findings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
    Google ScholarFindings
  • John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT, 2010.
    Google ScholarLocate open access versionFindings
  • Carl-Friedrich Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. Commentationes Societatis Regiae Scientiarum Gottingensis Recentiores, 1823.
    Google ScholarLocate open access versionFindings
  • Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 2012.
    Google ScholarFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
    Google ScholarFindings
  • Liyuan Liu, Xiang Ren, Jingbo Shang, Jian Peng, and Jiawei Han. Efficient contextualized representation: Language model pruning for sequence labeling. EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Robert Nau. Forecasting with moving averages. 2014.
    Google ScholarFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Martin Popel and Ondrej Bojar. Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics, 110(1):43–70, 2018.
    Google ScholarLocate open access versionFindings
  • Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Kirk M Wolter. Taylor series methods. In Introduction to variance estimation. 2007.
    Google ScholarFindings
  • Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. Dscovr: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. J. Mach. Learn. Res., 2017.
    Google ScholarLocate open access versionFindings
  • Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
    Findings
  • Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Our implementation is based on the previous work (Liu et al., 2018). Specifically, we use two-layer LSTMs with 2048 hidden states with adaptive softmax to conduct experiments on the one billion words dataset. Word embedding (random initialized) of 300 dimensions is used as the input and the adaptive softmax is incorporated with a default setting (cut-offs are set to [4000, 40000, 200000]). Additionally, as pre-processing, we replace all tokens occurring equal or less than 3 times with as UNK, which shrinks the dictionary from 7.9M to 6.4M. Dropout is applied to each layer with a ratio of 0.1, gradients are clipped at 5.0. We use the default hyper-parameters to update moving averages, i.e.β1 = 0.9 and β2 = 0.999. The learning rate is set to start from 0.001, and decayed at the start of 10th epochs. LSTMs are unrolled for 20 steps without resetting the LSTM states and the batch size is set to 128. All models are trained on one NVIDIA Tesla V100 GPU.
    Google ScholarLocate open access versionFindings
  • We use the default ResNet architectures (He et al., 2016) in a public pytorch re-implementation4.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科