## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations

NIPS 2020, (2020)

EI

Keywords

Abstract

This work proposes a novel analysis of stochastic gradient descent (SGD) for non-convex and smooth optimization. Our analysis sheds light on the impact of the probability distribution of the gradient noise on the convergence rate of the norm of the gradient. In the case of sub-Gaussian and centered noise, we prove that, with probability 1...More

Code:

Data:

Introduction

- Stochastic Gradient Descent (SGD) and its variants (Adam [1], RMSProp [2], or Nesterov’s accelerated gradient descent [3]) are used in a wide variety of tasks to train Machine Learning models.
- Several authors explored these frameworks by adapting the tools developed in convex analysis to the non-convex setting in order to explain these phenomena [15, 18, 19, 20, 21, 22, 23, 24, 25]
- None of these works proposed a unified framework able to handle both bounded and heavy-tailed noises

Highlights

- Stochastic Gradient Descent (SGD) and its variants (Adam [1], RMSProp [2], or Nesterov’s accelerated gradient descent [3]) are used in a wide variety of tasks to train Machine Learning models
- While stochastic gradient descent (SGD) is known to be robust in practice and its convergence behavior is well-understood in the convex setting [11, 12, 3, 13], many of its properties are not yet fully understood, and in settings related to Deep Learning practice where gradients can be extremely noisy and the target function presents many local optima
- We focus on stochastic gradient descent (SGD), a simple yet efficient optimization algorithm widely used in the Machine Learning community to minimize the training loss, and for the training of neural networks
- (2) The choice of the step-size ηt ∝ t−1/b does lead to the convergence rate of order t(b−1)/b exhibited in Theorem 17. (3) The standard step-sizes ηt ∝ t−1/2 and ηt ∝ 1 lead to suboptimal convergence rates, indicating that the choice ηt ∝ t−1/b may be valuable for practitioners when the noise distribution is fat-tailed
- This paper proposed a novel unifying analysis of stochastic gradient descent in the noisy and nonconvex setting
- We showed that SGD is robust in the non-convex setting over a large panel of noise assumptions, including infinite variance heavy-tailed noises

Methods

- The authors illustrate the practical implications of the results obtained in the paper.

Protocol. - The authors illustrate the practical implications of the results obtained in the paper.
- To illustrate the theoretical results (e.g. Theorem 17), the authors considered the noisy gradient approximation Gt = ∇f + Xt where Xt is a heavy-tail (Student’s t) noise distribution of tail-index b = 1, 5.
- The authors computed the empirical quantiles, expectations and biased expectations of the series of the random variables (1/t) ·

Results

- Results are displayed in Figure 3.
- These results show several aspects of the experiments: the averaged expectation 1 t t i=1.
- (2) The choice of the step-size ηt ∝ t−1/b does lead to the convergence rate of order t(b−1)/b exhibited in Theorem 17.
- (3) The standard step-sizes ηt ∝ t−1/2 and ηt ∝ 1 lead to suboptimal convergence rates, indicating that the choice ηt ∝ t−1/b may be valuable for practitioners when the noise distribution is fat-tailed.
- ∇f (xi) 2 reaches extremely large values compared to the values of quantiles. (2) The choice of the step-size ηt ∝ t−1/b does lead to the convergence rate of order t(b−1)/b exhibited in Theorem 17. (3) The standard step-sizes ηt ∝ t−1/2 and ηt ∝ 1 lead to suboptimal convergence rates, indicating that the choice ηt ∝ t−1/b may be valuable for practitioners when the noise distribution is fat-tailed. (4) Biased expectations μ−s(1/t t i=1

Conclusion

- This paper proposed a novel unifying analysis of stochastic gradient descent in the noisy and nonconvex setting.
- The authors introduced a novel operator: unbiased expectations that provide powerful tools for stochastic analysis.
- Using this tool, the authors showed that SGD is robust in the non-convex setting over a large panel of noise assumptions, including infinite variance heavy-tailed noises.
- Based of the theoretical nature of the work, the authors do not believe this section is applicable to the present contribution, as its first goal is to provide some insights on a classical algorithm of the machine learning community and does not provide novel applications per se

Summary

## Introduction:

Stochastic Gradient Descent (SGD) and its variants (Adam [1], RMSProp [2], or Nesterov’s accelerated gradient descent [3]) are used in a wide variety of tasks to train Machine Learning models.- Several authors explored these frameworks by adapting the tools developed in convex analysis to the non-convex setting in order to explain these phenomena [15, 18, 19, 20, 21, 22, 23, 24, 25]
- None of these works proposed a unified framework able to handle both bounded and heavy-tailed noises
## Objectives:

This paper aims at filling this gap by providing a novel unified analysis of the convergence of SGD in a non-convex and noisy setting.## Methods:

The authors illustrate the practical implications of the results obtained in the paper.

Protocol.- The authors illustrate the practical implications of the results obtained in the paper.
- To illustrate the theoretical results (e.g. Theorem 17), the authors considered the noisy gradient approximation Gt = ∇f + Xt where Xt is a heavy-tail (Student’s t) noise distribution of tail-index b = 1, 5.
- The authors computed the empirical quantiles, expectations and biased expectations of the series of the random variables (1/t) ·
## Results:

Results are displayed in Figure 3.- These results show several aspects of the experiments: the averaged expectation 1 t t i=1.
- (2) The choice of the step-size ηt ∝ t−1/b does lead to the convergence rate of order t(b−1)/b exhibited in Theorem 17.
- (3) The standard step-sizes ηt ∝ t−1/2 and ηt ∝ 1 lead to suboptimal convergence rates, indicating that the choice ηt ∝ t−1/b may be valuable for practitioners when the noise distribution is fat-tailed.
- ∇f (xi) 2 reaches extremely large values compared to the values of quantiles. (2) The choice of the step-size ηt ∝ t−1/b does lead to the convergence rate of order t(b−1)/b exhibited in Theorem 17. (3) The standard step-sizes ηt ∝ t−1/2 and ηt ∝ 1 lead to suboptimal convergence rates, indicating that the choice ηt ∝ t−1/b may be valuable for practitioners when the noise distribution is fat-tailed. (4) Biased expectations μ−s(1/t t i=1
## Conclusion:

This paper proposed a novel unifying analysis of stochastic gradient descent in the noisy and nonconvex setting.- The authors introduced a novel operator: unbiased expectations that provide powerful tools for stochastic analysis.
- Using this tool, the authors showed that SGD is robust in the non-convex setting over a large panel of noise assumptions, including infinite variance heavy-tailed noises.
- Based of the theoretical nature of the work, the authors do not believe this section is applicable to the present contribution, as its first goal is to provide some insights on a classical algorithm of the machine learning community and does not provide novel applications per se

- Table1: Examples of the constants satisfying Assumption 13 (assuming f is L-Lipschitz) for different noise assumptions. All distributions were chosen so that var(Xt | Ft) ≤ σ2

Related work

- Lower and upper bounds for first-order optimization in convex settings have been well-studied and understood in the literature (see, e.g., [11, 12, 3, 13]). Here, we focus on the results related to non-convex settings, and more specifically on the complexity of finding an ε-stationnary point (i.e. a point xt such that E[ ∇f (xt) 2] ≤ ε). First, several universal lower bounds have been provided for the convergence of any first-order algorithm [26, 27]. For smooth and noiseless setting, [26] established that Ω(ε−1) gradient evaluations are necessary for finding ε-stationary points; and showed that this rate is achieved by gradient descent. For smooth and bounded variance noise, [27] went on showing that Ω(ε−2) noisy gradient evaluations are required to reach an ε-stationnary point, proving as a byproduct that the SGD is optimal with this worst case metric. With regards to the performance of SGD, [23] established an O(ε−2) upper bound for the smooth, bounded variance and light-tail noise setting. Moreover, [28] went on showing that SGD itself cannot obtain a rate better than Ω(ε−2) in this noise setting, even for convex functions. For the smooth and heavy-tailed noise setting, [22] reports a complexity of O(ε−b/(b−1)) where b > 1 denotes the tail-index for SGD using a slightly different Hölder-smoothness assumption. With regards to these works, our analysis allows to recover the standard results of SGD (Theorems 11 and 14) while extending the convergence rates to heavy-tailed noise (Theorem 17) using a single and unified analysis. In addition, we obtain novel results for biased noise (Theorem 12) as well as more generic bounds on quantiles instead of in-expectations (Theorems 14 and 12), explaining the convergence of multi-start strategies as well as the case of infinite variance noise. Finally, it is also worth mentioning that a recent line of works have been devoted to the design of algorithms that improve the convergence rate of SGD for non-convex problems using additional assumptions (we refer to [18] for a review). For instance, [29]

Reference

- Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26– 31, 2012.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995.
- David G Kleinbaum, K Dietz, M Gail, Mitchel Klein, and Mitchell Klein. Logistic regression. Springer, 2002.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.
- Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
- Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
- Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186.
- Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
- Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225– 1234. PMLR, 2016.
- Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why ADAM Beats SGD for Attention Models. arXiv preprint arXiv:1912.03194, 2019.
- Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837, 2019.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. In Advances in Neural Information Processing Systems, pages 1157–1167, 2018.
- Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247, 2019.
- Mark Schmidt, Nicolas L Roux, and Francis R Bach. Convergence rates of inexact proximalgradient methods for convex optimization. In Advances in neural information processing systems, pages 1458–1466, 2011.
- Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1195–1204. PMLR, 2019.
- Umut Simsekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018, 2019.
- Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 983–992, 2019.
- Koulik Khamaru and Martin J Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. Journal of Machine Learning Research, 20(154):1–52, 2019.
- Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, pages 1–50, 2019.
- Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
- Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent. arXiv preprint arXiv:1910.01845, 2019.
- Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
- Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pages 15236–15245, 2019.
- Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal nonconvex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
- Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
- Pascal Massart. Concentration inequalities and model selection, volume 6.
- Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

Tags

Comments