Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views27
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We provide a new viewpoint for understanding the generalization performance gap

Abstract:

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these...More

Code:

Data:

0
Introduction
  • Stochastic gradient descent (SGD) [3, 4] has become one of the most popular algorithms for training deep neural networks [5,6,7,8,9,10,11].
  • In spite of its simplicity and effectiveness, SGD uses one learning rate for all gradient coordinates and could suffer from unsatisfactory convergence performance, especially for ill-conditioned problems [12]
  • To avoid this issue, a variety of adaptive gradient algorithms have been developed that adjust learning rate for each gradient coordinate according to the current geometry curvature of the objective function [13,14,15,16].
  • Recent evidence [2, 22] shows that (1) for deep neural networks, the minima at the asymmetric
Highlights
  • Stochastic gradient descent (SGD) [3, 4] has become one of the most popular algorithms for training deep neural networks [5,6,7,8,9,10,11]
  • These algorithms, especially for ADAM, have achieved much faster convergence speed than vanilla SGD in practice. Despite their faster convergence behaviors, these adaptive gradient algorithms usually suffer from worse generalization performance than SGD [12, 17, 18]
  • One empirical explanation [1, 19,20,21] for this generalization gap is that adaptive gradient algorithms tend to converge to sharp minima whose local basin has large curvature and usually generalize poorly, while SGD prefers to find flat minima and generalizes better
  • By looking into the local convergence behaviors of the Lévy-driven stochastic differential equations (SDEs) of these algorithms through analyzing their escaping time, we prove that for the same basin, SGD has smaller escaping time than ADAM and tends to converge to flatter minima whose local basins have larger Radon measure, explaining its better generalization performance
  • This work theoretically analyzes a fundamental problem in deep learning field, namely the generalization gap between adaptive gradient algorithms and SGD, and reveals the essential reasons for the generalization degeneration of adaptive algorithms
  • The established theoretical understanding of these algorithms may inspire new algorithms with both fast convergence speed and good generalization performance, which alleviate the need for computational resource and achieve state-of-the-art results
Methods
  • The authors first investigate the gradient noise in ADAM and SGD, and show their iterationbased convergence behaviors to testify the implications of the escaping theory.
  • Fig. 1 in Sec. 1 and Fig. 4 in Appendix B show that the gradient noise in both SGD and ADAM usually reveal the heavy tails and can be well characterized by SαS distribution.
  • This testifies the heavy tail assumption on the gradient noise in the theories
Conclusion
  • The authors analyzed the generalization performance degeneration of ADAM-alike adaptive algorithms over SGD.
  • By looking into the local convergence behaviors of the Lévy-driven SDEs of these algorithms through analyzing their escaping time, the authors prove that for the same basin, SGD has smaller escaping time than ADAM and tends to converge to flatter minima whose local basins have larger Radon measure, explaining its better generalization performance.
  • It still needs more efforts to provide more insights to design practical algorithms
Summary
  • Introduction:

    Stochastic gradient descent (SGD) [3, 4] has become one of the most popular algorithms for training deep neural networks [5,6,7,8,9,10,11].
  • In spite of its simplicity and effectiveness, SGD uses one learning rate for all gradient coordinates and could suffer from unsatisfactory convergence performance, especially for ill-conditioned problems [12]
  • To avoid this issue, a variety of adaptive gradient algorithms have been developed that adjust learning rate for each gradient coordinate according to the current geometry curvature of the objective function [13,14,15,16].
  • Recent evidence [2, 22] shows that (1) for deep neural networks, the minima at the asymmetric
  • Objectives:

    This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors.
  • Methods:

    The authors first investigate the gradient noise in ADAM and SGD, and show their iterationbased convergence behaviors to testify the implications of the escaping theory.
  • Fig. 1 in Sec. 1 and Fig. 4 in Appendix B show that the gradient noise in both SGD and ADAM usually reveal the heavy tails and can be well characterized by SαS distribution.
  • This testifies the heavy tail assumption on the gradient noise in the theories
  • Conclusion:

    The authors analyzed the generalization performance degeneration of ADAM-alike adaptive algorithms over SGD.
  • By looking into the local convergence behaviors of the Lévy-driven SDEs of these algorithms through analyzing their escaping time, the authors prove that for the same basin, SGD has smaller escaping time than ADAM and tends to converge to flatter minima whose local basins have larger Radon measure, explaining its better generalization performance.
  • It still needs more efforts to provide more insights to design practical algorithms
Related work
  • Adaptive gradient algorithms have become the default optimization tools in deep learning because of their fast convergence speed. But they often suffer from worse generalization performance than SGD [12, 17, 30, 31]. Subsequently, most works [12, 17, 18, 30, 31] empirically analyze this issue from the argument of flat and sharp minima defined on local curvature in [19] that flat minima often generalize better than sharp ones, as they observed that SGD often converges to flatter minima than adaptive gradient algorithms, e.g. ADAM. However, Sagun et al [22] and He et al [2] observed that the minima of modern deep networks at the asymmetric valleys where both steep and flat directions exist also generalize well, and SGD often converges to these minima. So the conventional flat and sharp argument cannot explain these new results. This work theoretically shows that SGD tends to converge to the minima whose local basin has larger Radon measure. It well explains the above new observations, as the minima with larger Radon measure often locate at the flat and asymmetric basins/valleys. Moreover, based on our results, exploring invariant Radon measure to parameter scaling in networks could resolve the issue in [32] that flat minima could become sharp via parameter scaling. See more details in Appendix C. Note, ADAM could achieve better performance than SGD when gradient clipping is required [33], e.g. attention models with gradient exploding issue, as adaptation in ADAM provides a clipping effect. This work considers a general non-gradient-exploding setting, as it is more practical across many important tasks, e.g. classification.
Reference
  • N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. Int’l Conf. Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • H. He, G. Huang, and Y. Yuan. Asymmetric valleys: Beyond sharp and flat local minima. In Proc. Conf. Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
    Google ScholarFindings
  • L. Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8):12, 1991.
    Google ScholarLocate open access versionFindings
  • Y. Bengio. Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
    Google ScholarLocate open access versionFindings
  • P. Zhou, Y. Hou, and J. Feng. Deep adversarial subspace clustering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng. Efficient meta learning via minibatch proximal update. In Proc. Conf. Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • P. Zhou, C. Xiong, R. Socher, and S. Hoi. Theory-inspired path-regularized differential network architecture search. In Proc. Conf. Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • N. Keskar and R. Socher. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628, 2017.
    Findings
  • J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. of Machine Learning Research, 12(Jul):2121–2159, 2011.
    Google ScholarLocate open access versionFindings
  • T. Tieleman and G. Hinton. Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Int’l Conf. Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
    Findings
  • A. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Proc. Conf. Neural Information Processing Systems, pages 4148–4158, 2017.
    Google ScholarLocate open access versionFindings
  • L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic bound of learning rate. In Int’l Conf. Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
    Google ScholarLocate open access versionFindings
  • P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
    Findings
  • H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. In Proc. Conf. Neural Information Processing Systems, pages 6389–6399, 2018.
    Google ScholarLocate open access versionFindings
  • L. Sagun, U. Evci, V. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of overparametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
    Findings
  • U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Proc. Int’l Conf. Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • P. Levy. Théorie de l’addition des variables aléatoires, gauthier-villars, paris, 1937. LévyThéorie de l’addition des variables aléatoires1937, 1954.
    Google ScholarFindings
  • S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In Proc. Int’l Conf. Machine Learning, pages 354–363, 2016.
    Google ScholarLocate open access versionFindings
  • S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.
    Findings
  • P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop, pages 1–10. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • I. Pavlyukevich. Cooling down Lévy flights. Journal of Physics A: Mathematical and Theoretical, 40(41), 2007.
    Google ScholarLocate open access versionFindings
  • I. Pavlyukevich. First exit times of solutions of stochastic differential equations driven by multiplicative lévy noise with heavy tails. Stochastics and Dynamics, 11(02n03):495–519, 2011.
    Google ScholarLocate open access versionFindings
  • S. Merity, N. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017.
    Findings
  • I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
    Findings
  • L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. In Proc. Int’l Conf. Machine Learning, pages 1019–1028, 2017.
    Google ScholarLocate open access versionFindings
  • J. Zhang, S. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why adam beats sgd for attention models. arXiv preprint arXiv:1912.03194, 2019.
    Findings
  • Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. In Proc. Int’l Conf. Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • S. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
    Google ScholarLocate open access versionFindings
  • P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of lévy-driven langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010.
    Google ScholarLocate open access versionFindings
  • Q. Li, C. Tai, and W. E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proc. Int’l Conf. Machine Learning, pages 2101–2110, 2017.
    Google ScholarLocate open access versionFindings
  • L. Simon. Lectures on geometric measure theory. The Australian National University, Mathematical Sciences Institute, Centre..., 1983.
    Google ScholarLocate open access versionFindings
  • S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
    Google ScholarLocate open access versionFindings
  • Pan Zhou, Xiaotong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresholding. In Proc. Conf. Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Conf. Neural Information Processing Systems, pages 315–323, 2013.
    Google ScholarLocate open access versionFindings
  • P. Zhou, X. Yuan, and J. Feng. New insight into hybrid stochastic gradient descent: Beyond withreplacement sampling and convexity. In Proc. Conf. Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • P. Zhou, X. Yuan, and J. Feng. Faster first-order methods for stochastic non-convex optimization on riemannian manifolds. In Int’l Conf. Artificial Intelligence and Statistics, 2019.
    Google ScholarLocate open access versionFindings
  • P. Zhou and X. Tong. Hybrid stochastic-deterministic minibatch proximal gradient: Less-than-single-pass optimization with nearly optimal generalization. In Proc. Int’l Conf. Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In Proc. Int’l Conf. Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proc. Int’l Conf. Machine Learning, pages 3404–3413, 2017.
    Google ScholarLocate open access versionFindings
  • P. Zhou and J. Feng. Understanding generalization and optimization performance of deep cnns. In Proc. Int’l Conf. Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • P. Zhou and J. Feng. Empirical risk landscape analysis for understanding deep neural networks. In Int’l Conf. Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • L. Wu, C. Ma, and W. E. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In Proc. Conf. Neural Information Processing Systems, pages 8279–8288, 2018.
    Google ScholarLocate open access versionFindings
  • A. Anandkumar and R. Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Conf. on Learning Theory, pages 81–102, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Wu and K. He. Group normalization. In Proc. European Conf. Computer Vision, pages 3–19, 2018.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Conf. Neural Information Processing Systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • M. Mohammadi, A. Mohammadpour, and H. Ogata. On estimating the tail index and the spectral measure of multivariate α-stable distributions. Metrika, 78(5):549–561, 2015.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition. Proceedings of the IEEE, page 2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • A. Bishop and P. Del Moral. Stability properties of systems of linear stochastic differential equations with random coefficients. SIAM Journal on Control and Optimization, 57(2):1023–1042, 2019.
    Google ScholarLocate open access versionFindings
  • A. Kohatsu-Higa, J. León, and D. Nualart. Stochastic differential equations with random coefficients. Bernoulli, 3(2):233–245, 1997.
    Google ScholarLocate open access versionFindings
  • Y. Fang and K. Loparo. Stabilization of continuous-time jump linear systems. IEEE Transactions on Automatic Control, 47(10):1590–1603, 2002.
    Google ScholarLocate open access versionFindings
  • Andrew EB Lim and Xun Yu Zhou. Mean-variance portfolio selection with random parameters in a complete market. Mathematics of Operations Research, 27(1):101–120, 2002.
    Google ScholarLocate open access versionFindings
  • Stephen J Turnovsky. Optimal stabilization policies for deterministic and stochastic linear economic systems. The Review of Economic Studies, 40(1):79–95, 1973.
    Google ScholarLocate open access versionFindings
  • Jawahar Lal Tiwari and John E Hobbie. Random differential equations as models of ecosystems: Monte carlo simulation approach. Mathematical Biosciences, 28(1-2):25–44, 1976.
    Google ScholarLocate open access versionFindings
  • Chris P Tsokos and William J Padgett. Random integral equations with applications to life sciences and engineering. Academic Press, 1974.
    Google ScholarFindings
  • Brad A Finney, David S Bowles, and Michael P Windham. Random differential equations in river water quality modeling. Water resources research, 18(1):122–134, 1982.
    Google ScholarLocate open access versionFindings
  • T. Gronwall. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Annals of Mathematics, pages 292–296, 1919.
    Google ScholarLocate open access versionFindings
  • A. Papapantoleon. An introduction to lévy processes with applications in finance. arXiv preprint arXiv:0804.0482, 2008.
    Findings
  • O. Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006. (1) The process ξ in Eqn. (9) can be decomposed into two processes ξ and linear drift, namely, ξt = ξt + μεt, (12)
    Google ScholarLocate open access versionFindings
  • 0. Suppose ε is sufficient small such that such that Θ(1) (1) The process (
    Google ScholarFindings
  • (2) If f ∈ L1(A), then
    Google ScholarLocate open access versionFindings
  • 22. So this gives θt − θ∗
    Google ScholarFindings
  • 0. With help of the elementary inequality ex
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments