# Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

NIPS 2020, 2020.

EI

Weibo:

Abstract:

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these...More

Code:

Data:

Introduction

- Stochastic gradient descent (SGD) [3, 4] has become one of the most popular algorithms for training deep neural networks [5,6,7,8,9,10,11].
- In spite of its simplicity and effectiveness, SGD uses one learning rate for all gradient coordinates and could suffer from unsatisfactory convergence performance, especially for ill-conditioned problems [12]
- To avoid this issue, a variety of adaptive gradient algorithms have been developed that adjust learning rate for each gradient coordinate according to the current geometry curvature of the objective function [13,14,15,16].
- Recent evidence [2, 22] shows that (1) for deep neural networks, the minima at the asymmetric

Highlights

- Stochastic gradient descent (SGD) [3, 4] has become one of the most popular algorithms for training deep neural networks [5,6,7,8,9,10,11]
- These algorithms, especially for ADAM, have achieved much faster convergence speed than vanilla SGD in practice. Despite their faster convergence behaviors, these adaptive gradient algorithms usually suffer from worse generalization performance than SGD [12, 17, 18]
- One empirical explanation [1, 19,20,21] for this generalization gap is that adaptive gradient algorithms tend to converge to sharp minima whose local basin has large curvature and usually generalize poorly, while SGD prefers to find flat minima and generalizes better
- By looking into the local convergence behaviors of the Lévy-driven stochastic differential equations (SDEs) of these algorithms through analyzing their escaping time, we prove that for the same basin, SGD has smaller escaping time than ADAM and tends to converge to flatter minima whose local basins have larger Radon measure, explaining its better generalization performance
- This work theoretically analyzes a fundamental problem in deep learning field, namely the generalization gap between adaptive gradient algorithms and SGD, and reveals the essential reasons for the generalization degeneration of adaptive algorithms
- The established theoretical understanding of these algorithms may inspire new algorithms with both fast convergence speed and good generalization performance, which alleviate the need for computational resource and achieve state-of-the-art results

Methods

- The authors first investigate the gradient noise in ADAM and SGD, and show their iterationbased convergence behaviors to testify the implications of the escaping theory.
- Fig. 1 in Sec. 1 and Fig. 4 in Appendix B show that the gradient noise in both SGD and ADAM usually reveal the heavy tails and can be well characterized by SαS distribution.
- This testifies the heavy tail assumption on the gradient noise in the theories

Conclusion

- The authors analyzed the generalization performance degeneration of ADAM-alike adaptive algorithms over SGD.
- By looking into the local convergence behaviors of the Lévy-driven SDEs of these algorithms through analyzing their escaping time, the authors prove that for the same basin, SGD has smaller escaping time than ADAM and tends to converge to flatter minima whose local basins have larger Radon measure, explaining its better generalization performance.
- It still needs more efforts to provide more insights to design practical algorithms

Summary

## Introduction:

Stochastic gradient descent (SGD) [3, 4] has become one of the most popular algorithms for training deep neural networks [5,6,7,8,9,10,11].- In spite of its simplicity and effectiveness, SGD uses one learning rate for all gradient coordinates and could suffer from unsatisfactory convergence performance, especially for ill-conditioned problems [12]
- To avoid this issue, a variety of adaptive gradient algorithms have been developed that adjust learning rate for each gradient coordinate according to the current geometry curvature of the objective function [13,14,15,16].
- Recent evidence [2, 22] shows that (1) for deep neural networks, the minima at the asymmetric
## Objectives:

This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors.## Methods:

The authors first investigate the gradient noise in ADAM and SGD, and show their iterationbased convergence behaviors to testify the implications of the escaping theory.- Fig. 1 in Sec. 1 and Fig. 4 in Appendix B show that the gradient noise in both SGD and ADAM usually reveal the heavy tails and can be well characterized by SαS distribution.
- This testifies the heavy tail assumption on the gradient noise in the theories
## Conclusion:

The authors analyzed the generalization performance degeneration of ADAM-alike adaptive algorithms over SGD.- By looking into the local convergence behaviors of the Lévy-driven SDEs of these algorithms through analyzing their escaping time, the authors prove that for the same basin, SGD has smaller escaping time than ADAM and tends to converge to flatter minima whose local basins have larger Radon measure, explaining its better generalization performance.
- It still needs more efforts to provide more insights to design practical algorithms

Related work

- Adaptive gradient algorithms have become the default optimization tools in deep learning because of their fast convergence speed. But they often suffer from worse generalization performance than SGD [12, 17, 30, 31]. Subsequently, most works [12, 17, 18, 30, 31] empirically analyze this issue from the argument of flat and sharp minima defined on local curvature in [19] that flat minima often generalize better than sharp ones, as they observed that SGD often converges to flatter minima than adaptive gradient algorithms, e.g. ADAM. However, Sagun et al [22] and He et al [2] observed that the minima of modern deep networks at the asymmetric valleys where both steep and flat directions exist also generalize well, and SGD often converges to these minima. So the conventional flat and sharp argument cannot explain these new results. This work theoretically shows that SGD tends to converge to the minima whose local basin has larger Radon measure. It well explains the above new observations, as the minima with larger Radon measure often locate at the flat and asymmetric basins/valleys. Moreover, based on our results, exploring invariant Radon measure to parameter scaling in networks could resolve the issue in [32] that flat minima could become sharp via parameter scaling. See more details in Appendix C. Note, ADAM could achieve better performance than SGD when gradient clipping is required [33], e.g. attention models with gradient exploding issue, as adaptation in ADAM provides a clipping effect. This work considers a general non-gradient-exploding setting, as it is more practical across many important tasks, e.g. classification.

Reference

- N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. Int’l Conf. Learning Representations, 2017.
- H. He, G. Huang, and Y. Yuan. Asymmetric valleys: Beyond sharp and flat local minima. In Proc. Conf. Neural Information Processing Systems, 2019.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- L. Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8):12, 1991.
- Y. Bengio. Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
- Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- P. Zhou, Y. Hou, and J. Feng. Deep adversarial subspace clustering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 770–778, 2016.
- P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng. Efficient meta learning via minibatch proximal update. In Proc. Conf. Neural Information Processing Systems, 2019.
- P. Zhou, C. Xiong, R. Socher, and S. Hoi. Theory-inspired path-regularized differential network architecture search. In Proc. Conf. Neural Information Processing Systems, 2019.
- N. Keskar and R. Socher. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628, 2017.
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. of Machine Learning Research, 12(Jul):2121–2159, 2011.
- T. Tieleman and G. Hinton. Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Int’l Conf. Learning Representations, 2014.
- S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- A. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Proc. Conf. Neural Information Processing Systems, pages 4148–4158, 2017.
- L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic bound of learning rate. In Int’l Conf. Learning Representations, 2019.
- S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
- P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. In Proc. Conf. Neural Information Processing Systems, pages 6389–6399, 2018.
- L. Sagun, U. Evci, V. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of overparametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Proc. Int’l Conf. Machine Learning, 2019.
- P. Levy. Théorie de l’addition des variables aléatoires, gauthier-villars, paris, 1937. LévyThéorie de l’addition des variables aléatoires1937, 1954.
- S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In Proc. Int’l Conf. Machine Learning, pages 354–363, 2016.
- S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.
- P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop, pages 1–10. IEEE, 2018.
- I. Pavlyukevich. Cooling down Lévy flights. Journal of Physics A: Mathematical and Theoretical, 40(41), 2007.
- I. Pavlyukevich. First exit times of solutions of stochastic differential equations driven by multiplicative lévy noise with heavy tails. Stochastics and Dynamics, 11(02n03):495–519, 2011.
- S. Merity, N. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017.
- I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. In Proc. Int’l Conf. Machine Learning, pages 1019–1028, 2017.
- J. Zhang, S. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why adam beats sgd for attention models. arXiv preprint arXiv:1912.03194, 2019.
- Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. In Proc. Int’l Conf. Machine Learning, 2019.
- S. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of lévy-driven langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010.
- Q. Li, C. Tai, and W. E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proc. Int’l Conf. Machine Learning, pages 2101–2110, 2017.
- L. Simon. Lectures on geometric measure theory. The Australian National University, Mathematical Sciences Institute, Centre..., 1983.
- S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Pan Zhou, Xiaotong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresholding. In Proc. Conf. Neural Information Processing Systems, 2018.
- R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Conf. Neural Information Processing Systems, pages 315–323, 2013.
- P. Zhou, X. Yuan, and J. Feng. New insight into hybrid stochastic gradient descent: Beyond withreplacement sampling and convexity. In Proc. Conf. Neural Information Processing Systems, 2018.
- P. Zhou, X. Yuan, and J. Feng. Faster first-order methods for stochastic non-convex optimization on riemannian manifolds. In Int’l Conf. Artificial Intelligence and Statistics, 2019.
- P. Zhou and X. Tong. Hybrid stochastic-deterministic minibatch proximal gradient: Less-than-single-pass optimization with nearly optimal generalization. In Proc. Int’l Conf. Machine Learning, 2020.
- S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In Proc. Int’l Conf. Machine Learning, 2019.
- Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proc. Int’l Conf. Machine Learning, pages 3404–3413, 2017.
- P. Zhou and J. Feng. Understanding generalization and optimization performance of deep cnns. In Proc. Int’l Conf. Machine Learning, 2018.
- P. Zhou and J. Feng. Empirical risk landscape analysis for understanding deep neural networks. In Int’l Conf. Learning Representations, 2018.
- L. Wu, C. Ma, and W. E. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In Proc. Conf. Neural Information Processing Systems, pages 8279–8288, 2018.
- A. Anandkumar and R. Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Conf. on Learning Theory, pages 81–102, 2016.
- Y. Wu and K. He. Group normalization. In Proc. European Conf. Computer Vision, pages 3–19, 2018.
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Conf. Neural Information Processing Systems, pages 1097–1105, 2012.
- M. Mohammadi, A. Mohammadpour, and H. Ogata. On estimating the tail index and the spectral measure of multivariate α-stable distributions. Metrika, 78(5):549–561, 2015.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition. Proceedings of the IEEE, page 2278–2324, 1998.
- A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- A. Bishop and P. Del Moral. Stability properties of systems of linear stochastic differential equations with random coefficients. SIAM Journal on Control and Optimization, 57(2):1023–1042, 2019.
- A. Kohatsu-Higa, J. León, and D. Nualart. Stochastic differential equations with random coefficients. Bernoulli, 3(2):233–245, 1997.
- Y. Fang and K. Loparo. Stabilization of continuous-time jump linear systems. IEEE Transactions on Automatic Control, 47(10):1590–1603, 2002.
- Andrew EB Lim and Xun Yu Zhou. Mean-variance portfolio selection with random parameters in a complete market. Mathematics of Operations Research, 27(1):101–120, 2002.
- Stephen J Turnovsky. Optimal stabilization policies for deterministic and stochastic linear economic systems. The Review of Economic Studies, 40(1):79–95, 1973.
- Jawahar Lal Tiwari and John E Hobbie. Random differential equations as models of ecosystems: Monte carlo simulation approach. Mathematical Biosciences, 28(1-2):25–44, 1976.
- Chris P Tsokos and William J Padgett. Random integral equations with applications to life sciences and engineering. Academic Press, 1974.
- Brad A Finney, David S Bowles, and Michael P Windham. Random differential equations in river water quality modeling. Water resources research, 18(1):122–134, 1982.
- T. Gronwall. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Annals of Mathematics, pages 292–296, 1919.
- A. Papapantoleon. An introduction to lévy processes with applications in finance. arXiv preprint arXiv:0804.0482, 2008.
- O. Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006. (1) The process ξ in Eqn. (9) can be decomposed into two processes ξ and linear drift, namely, ξt = ξt + μεt, (12)
- 0. Suppose ε is sufficient small such that such that Θ(1) (1) The process (
- (2) If f ∈ L1(A), then
- 22. So this gives θt − θ∗
- 0. With help of the elementary inequality ex

Full Text

Tags

Comments