# Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

ICML, pp. 4542-4551, 2019.

EI

Weibo:

Abstract:

The evolution of a deep neural network trained by the gradient descent can be described by its neural tangent kernel (NTK) as introduced in [20], where it was proven that in the infinite width limit the NTK converges to an explicit limiting kernel and it stays constant during training. The NTK was also implicit in some other recent pape...More

Code:

Data:

Introduction

- Deep neural networks have become popular due to their unprecedented success in a variety of machine learning tasks.
- Training a deep neural network is usually done via a gradient decent based algorithm.
- Analyzing such training dynamics is challenging.
- As highly nonlinear structures, deep neural networks usually involve a large number of parameters.
- As highly non-convex optimization problems, there is no guarantee that a gradient based algorithm will be able to find the optimal parameters efficiently during the training of neural networks.
- One question arises: given such complexities, is it possible to obtain a succinct description of the training dynamics?

Highlights

- Deep neural networks have become popular due to their unprecedented success in a variety of machine learning tasks
- We study the dynamic of the neural tangent kernel for finite width deep fully-connected neural networks
- Training a deep neural network is usually done via a gradient decent based algorithm
- We show the gradient descent dynamic is captured by an infinite hierarchy of ordinary differential equations, the neural tangent hierarchy (NTH)
- We study the continuous time gradient descent of deep fully-connected neural networks
- We show that the training dynamic is given by a data dependent infinite hierarchy of ordinary differential equations, i.e., the neural tangent hierarchy

Conclusion

- The authors study the continuous time gradient descent of deep fully-connected neural networks.
- The authors show that this dynamic of the NTK can be approximated by a finite truncated dynamic up to any precision.
- This description makes it possible to directly study the change of the NTK for deep neural networks.
- The authors mainly study deep fully-connected neural networks here, the authors believe the same statements can be proven for convolutional and residual neural networks

Summary

## Introduction:

Deep neural networks have become popular due to their unprecedented success in a variety of machine learning tasks.- Training a deep neural network is usually done via a gradient decent based algorithm.
- Analyzing such training dynamics is challenging.
- As highly nonlinear structures, deep neural networks usually involve a large number of parameters.
- As highly non-convex optimization problems, there is no guarantee that a gradient based algorithm will be able to find the optimal parameters efficiently during the training of neural networks.
- One question arises: given such complexities, is it possible to obtain a succinct description of the training dynamics?
## Conclusion:

The authors study the continuous time gradient descent of deep fully-connected neural networks.- The authors show that this dynamic of the NTK can be approximated by a finite truncated dynamic up to any precision.
- This description makes it possible to directly study the change of the NTK for deep neural networks.
- The authors mainly study deep fully-connected neural networks here, the authors believe the same statements can be proven for convolutional and residual neural networks

Related work

- In this section, we survey an incomplete list of previous works on optimization aspect of deep neural networks.

Because of the highly non-convexity nature of deep neural networks, the gradient based algorithms can potentially get stuck near a critical point, i.e., saddle point or local minimum. So one important question in deep neural networks is: what does the loss landscape look like. One promising candidate for loss landscapes is the class of functions that satisfy: (i) all local minima are global minima and (ii) there exists a negative curvature for every saddle point. A line of recent results show that, in many optimization problems of interest [7, 15, 16, 33, 40, 41], loss landscapes are in such class. For this function class, (perturbed) gradient descent [15, 21, 28] can find a global minimum. However, even for a three-layer linear network, there exists a saddle point that does not have a negative curvature [22]. So it is unclear whether this geometry-based approach can be used to obtain the global convergence guarantee of first-order methods. Another approach is to show that practical deep neural networks allow some additional structure or assumption to make nonconvex optimizations tractable. Under certain simplification assumptions, it has been proven recently that there are novel loss landscape structures in deep neural networks, which may play a role in making the optimization tractable [9, 11, 22, 24, 30].

Funding

- Y. is partially supported by NSF Grants DMS-1606305 and DMS-1855509, and a Simons Investigator award

Reference

- Z. Allen-Zhu and Y. Li. What can resnet learn efficiently, going beyond kernels? arXiv preprint arXiv:1905.10337, 2019.
- Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
- Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In ICML, arXiv:1811.03962, 2018.
- D. Araujo, R. I. Oliveira, and D. Yukimura. A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193, 2019.
- S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019.
- S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
- L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pages 3036–3046, 2018.
- A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 192–204, 2015.
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537, 2011.
- Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. ICML, arXiv:1811.03804, 2018.
- S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, arXiv:1810.02054, 2018.
- R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
- R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
- B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.
- X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
- A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org, 2017.
- K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
- K. Kawaguchi and J. Huang. Gradient descent finds global minima for generalizable deep neural networks of practical sizes. arXiv preprint arXiv:1908.02419, 2019.
- K. Kawaguchi and L. P. Kaelbling. Elimination of all bad local minima in deep learning. arXiv preprint arXiv:1901.00279, 2019.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
- J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246–1257, 2016.
- Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
- S. Liang, R. Sun, J. D. Lee, and R. Srikant. Adding one neuron can eliminate all bad local minima. In Advances in Neural Information Processing Systems, 2018.
- S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks: dimensionfree bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019.
- P.-M. Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880, 2019.
- D. Park, A. Kyrillidis, C. Caramanis, and S. Sanghavi. Non-square matrix sensing without spurious local minima via the burer-monteiro approach. arXiv preprint arXiv:1609.03240, 2016.
- T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran. Deep convolutional neural networks for lvcsr. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8614– 8618. IEEE, 2013.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks. arXiv preprint arXiv:1903.04440, 2019.
- M. Song, A. Montanari, and P. Nguyen. A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences, 115:E7665–E7671, 2018.
- Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
- J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, 2016.
- J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
- R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing, pages 210–268. Cambridge Univ. Press, Cambridge, 2012.
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- G. Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. CoRR, abs/1902.04760, 2019.
- G. Yehudai and O. Shamir. On the power and limitations of random features for understanding neural networks. arXiv preprint arXiv:1904.00687, 2019.
- D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
- 1. We remark that the time t in Proposition A.1 is only a parameter and this proposition does not involve dynamics.
- 2. Then (C.25) implies

Full Text

Tags

Comments