# Gradient Descent Finds Global Minima of Deep Neural Networks

international conference on machine learning, 2018.

EI

Weibo:

Abstract:

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure ...More

Code:

Data:

Introduction

- One of the mysteries in deep learning is randomly initialized first-order methods like gradient descent achieve zero training loss, even if the labels are arbitrary (Zhang et al, 2016).
- The authors show if m = Ω poly(n)2O(H) 1, randomly initialized gradient descent converges to zero training loss at a linear rate.
- The authors show as long as m = Ω (poly(n, H)), randomly initialized gradient descent converges to zero training loss at a linear rate.

Highlights

- One of the mysteries in deep learning is randomly initialized first-order methods like gradient descent achieve zero training loss, even if the labels are arbitrary (Zhang et al, 2016)
- We show if m = poly(n, p, H) where p is the number of patches, randomly initialized gradient descent achieves zero training loss
- In this paper we focus on G(H)(k), the gram matrix induced by the weights from H-th layer for simplicity at the cost of a minor degradation in convergence rate
- We show that gradient descent on deep overparametrized networks can obtain zero training loss
- Our proof builds on a careful analysis of the random initialization scheme and a perturbation analysis which shows that the Gram matrix is increasingly stable under overparametrization
- The current paper focuses on the training loss, but does not address the test loss

Results

- The most related papers are (Li & Liang, 2018; Du et al, 2018b) who observed that when training an overparametrized two-layer fully-connected neural network, the weights do not change a large amount, which the authors use to show the stability of the Gram matrix.
- They used this observation to obtain the convergence rate of gradient descent on a two-layer over-parameterized neural network for the crossentropy and least-squares loss.
- To learn the deep neural network, the authors consider the randomly initialized gradient descent algorithm to find the global minimizer of the empirical loss (1).
- While following the similar high-level analysis framework proposed by Du et al (2018b), analyzing the convergence of gradient descent for deep neural network is significantly more involved and requires new technical tools.
- The authors develop a unified proof strategy for the fully-connected neural network, ResNet and convolutional ResNet. The authors' analysis in this step again sheds light on the benefit of using ResNet architecture for training.
- As a warm up, the authors show gradient descent with a constant positive step size converges to the global minimum at a linear rate.
- The authors are ready to state the main convergence result of gradient descent for deep fully-connected neural networks.
- Theorem 5.1 (Convergence Rate of Gradient Descent for Deep Fully-connected Neural Networks).
- This theorem states that if the width m is large enough and the authors set step size appropriately gradient descent converges to the global minimum with zero loss at linear rate.
- For deep fully-connected neural network, the authors require η to be exponentially small in terms of number of layers.
- The authors consider the convergence of gradient descent for training a ResNet. The authors will focus on how much over-parameterization is needed to ensure the global convergence of gradient descent and compare it with fullyconnected neural networks.

Conclusion

- This theorem is similar to that of ResNet. The number of neurons required per layer is only polynomial in the depth and the number of data points and step size is only polynomially small.
- 4. The convergence rate can be potentially improved if the minimum eigenvalue takes into account the contribution of all Gram matrices, but this would considerably complicate the initialization and perturbation analysis.
- To further investigate of generalization behavior, the authors believe some algorithmdependent analyses may be useful (Hardt et al, 2016; Mou et al, 2018; Chen et al, 2018)

Summary

- One of the mysteries in deep learning is randomly initialized first-order methods like gradient descent achieve zero training loss, even if the labels are arbitrary (Zhang et al, 2016).
- The authors show if m = Ω poly(n)2O(H) 1, randomly initialized gradient descent converges to zero training loss at a linear rate.
- The authors show as long as m = Ω (poly(n, H)), randomly initialized gradient descent converges to zero training loss at a linear rate.
- The most related papers are (Li & Liang, 2018; Du et al, 2018b) who observed that when training an overparametrized two-layer fully-connected neural network, the weights do not change a large amount, which the authors use to show the stability of the Gram matrix.
- They used this observation to obtain the convergence rate of gradient descent on a two-layer over-parameterized neural network for the crossentropy and least-squares loss.
- To learn the deep neural network, the authors consider the randomly initialized gradient descent algorithm to find the global minimizer of the empirical loss (1).
- While following the similar high-level analysis framework proposed by Du et al (2018b), analyzing the convergence of gradient descent for deep neural network is significantly more involved and requires new technical tools.
- The authors develop a unified proof strategy for the fully-connected neural network, ResNet and convolutional ResNet. The authors' analysis in this step again sheds light on the benefit of using ResNet architecture for training.
- As a warm up, the authors show gradient descent with a constant positive step size converges to the global minimum at a linear rate.
- The authors are ready to state the main convergence result of gradient descent for deep fully-connected neural networks.
- Theorem 5.1 (Convergence Rate of Gradient Descent for Deep Fully-connected Neural Networks).
- This theorem states that if the width m is large enough and the authors set step size appropriately gradient descent converges to the global minimum with zero loss at linear rate.
- For deep fully-connected neural network, the authors require η to be exponentially small in terms of number of layers.
- The authors consider the convergence of gradient descent for training a ResNet. The authors will focus on how much over-parameterization is needed to ensure the global convergence of gradient descent and compare it with fullyconnected neural networks.
- This theorem is similar to that of ResNet. The number of neurons required per layer is only polynomial in the depth and the number of data points and step size is only polynomially small.
- 4. The convergence rate can be potentially improved if the minimum eigenvalue takes into account the contribution of all Gram matrices, but this would considerably complicate the initialization and perturbation analysis.
- To further investigate of generalization behavior, the authors believe some algorithmdependent analyses may be useful (Hardt et al, 2016; Mou et al, 2018; Chen et al, 2018)

Related work

- Recently, many works try to study the optimization problem in deep learning. Since optimizing a neural network is a non-convex problem, one approach is first to develop a general theory for a class of non-convex problems which satisfy desired geometric properties and then identify that the neural network optimization problem belongs to this class. One promising candidate class is the set of functions that satisfy: a) all local minima are global and b) there exists a negative curvature for every saddle point. For this function class, researchers have shown (perturbed) gradient descent (Jin et al, 2017; Ge et al, 2015; Lee et al, 2016; Du et al, 2017a) can find a global minimum. Many previous works thus try to study the optimization landscape of neural networks with different activation functions (Soudry & Hoffer, 2017; Safran & Shamir, 2018; 2016; Zhou & Liang, 2017; Freeman & Bruna, 2016; Hardt & Ma, 2016; Nguyen & Hein, 2017; Kawaguchi, 2016; Venturi et al, 2018; Soudry & Carmon, 2016; Du & Lee, 2018; Soltanolkotabi et al, 2018; Haeffele & Vidal, 2015). However, even for a threelayer linear network, there exists a saddle point that does not have a negative curvature (Kawaguchi, 2016), so it is unclear whether this geometry-based approach can be used to obtain the global convergence guarantee of first-order methods.

Funding

- SSD acknowledges support from AFRL grant FA8750-17-2-0212 and DARPA D17AP00001
- JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303
- HL and LW acknowlege support from National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026) and BJNSF (L172037)

Reference

- Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
- Allen-Zhu, Z., Li, Y., and Song, Z. On the convergence rate of training recurrent neural networks. arXiv preprint arXiv:1810.12065, 2018b.
- Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018c.
- Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L. Learning polynomials with neural networks. In International Conference on Machine Learning, pp. 1908–1916, 2014.
- Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Finegrained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- Brutzkus, A. and Globerson, A. Globally optimal gradient descent for a ConvNet with gaussian inputs. In International Conference on Machine Learning, pp. 605–614, 2017.
- Chen, Y., Jin, C., and Yu, B. Stability and Convergence Trade-off of Iterative Optimization Algorithms. arXiv e-prints, art. arXiv:1804.01619, Apr 2018.
- Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv preprint arXiv:1805.09545, 2018a.
- Chizat, L. and Bach, F. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018b.
- Du, S. S. and Lee, J. D. On the power of overparametrization in neural networks with quadratic activation. Proceedings of the 35th International Conference on Machine Learning, pp. 1329–1338, 2018.
- Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., and Poczos, B. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pp. 1067–1077, 2017a.
- Du, S. S., Lee, J. D., and Tian, Y. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017b.
- Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Proceedings of the 35th International Conference on Machine Learning, pp. 1339–1348, 2018a.
- Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018b.
- Freeman, C. D. and Bruna, J. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
- Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points − online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pp. 797–842, 2015.
- Haeffele, B. D. and Vidal, R. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
- Hardt, M. and Ma, T. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1225–1234, New York, New York, USA, 20– 22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/hardt16.html.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Daniely, A. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2017.
- Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
- Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pp. 1724–1732, 2017.
- Kawaguchi, K. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pp. 586–594, 2016.
- Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Conference on Learning Theory, pp. 1246–1257, 2016.
- Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
- Li, Y. and Yuan, Y. Convergence analysis of two-layer neural networks with ReLU activation. In Advances in Neural Information Processing Systems, pp. 597–607, 2017.
- Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems 30, pp. 6231–6239. Curran Associates, Inc., 2017.
- Malliavin, P. Gaussian sobolev spaces and stochastic calculus of variations. 1995.
- Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
- Mei, S., Montanari, A., and Nguyen, P.-M. A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences, pp. E7665– E7671, 2018.
- Mou, W., Wang, L., Zhai, X., and Zheng, K. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 605–638. PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/mou18a.html.
- Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and SohlDickstein, J. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
- Rotskoff, G. M. and Vanden-Eijnden, E. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
- Safran, I. and Shamir, O. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016.
- Safran, I. and Shamir, O. Spurious local minima are common in two-layer ReLU neural networks. In International Conference on Machine Learning, pp. 4433–4441, 2018.
- Schoenholz, S. S., Gilmer, J., Ganguli, S., and SohlDickstein, J. Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.
- Sirignano, J. and Spiliopoulos, K. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018.
- Soltanolkotabi, M. Learning ReLUs via gradient descent. In Advances in Neural Information Processing Systems, pp. 2007–2017, 2017.
- Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical insights into the optimization landscape of overparameterized shallow neural networks. IEEE Transactions on Information Theory, 2018.
- Soudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Soudry, D. and Hoffer, E. Exponentially vanishing suboptimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
- Tian, Y. An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pp. 3404–3413, 2017.
- Venturi, L., Bandeira, A., and Bruna, J. Neural networks with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.
- Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In International Conference on Machine Learning, pp. 2603–2612, 2017.
- Wei, C., Lee, J. D., Liu, Q., and Ma, T. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369, 2018.
- Zagoruyko, S. and Komodakis, N. Wide residual networks. NIN, 8:35–67, 2016.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Zhang, H., Dauphin, Y. N., and Ma, T. Residual learning without normalization via better initialization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gsz30cKX.
- Zhang, X., Yu, Y., Wang, L., and Gu, Q. Learning onehidden-layer relu networks via gradient descent. arXiv preprint arXiv:1806.07808, 2018.
- Zhong, K., Song, Z., and Dhillon, I. S. Learning nonoverlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017a.
- Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017b.
- Zhou, Y. and Liang, Y. Critical points of neural networks: Analytical forms and landscape properties. arXiv preprint arXiv:1710.11205, 2017.
- Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.

Full Text

Tags

Comments