# The Loss Surfaces of Multilayer Networks

JMLR Workshop and Conference Proceedings, pp. 192-204, 2015.

EI

Weibo:

Abstract:

We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to ex...More

Code:

Data:

Introduction

- The authors study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity.
- The authors empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks
- The authors conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error.
- The authors prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting

Highlights

- We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity
- We found that stochastic gradient descent performed at least as well as simulated annealing, which indicates that becoming trapped in poor saddle points is not a problem in our experiments
- For the spin-glasses, we see that for small values of Λ, we obtain poor local minima on many experiments, while for larger values of Λ the distribution becomes increasingly concentrated around the energy barrier where local minima have high quality
- This paper establishes a connection between the neural network and the spin-glass model
- We show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth H has similar landscape to the Hamiltonian of the H-spin spherical spin-glass model
- We empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks

Methods

- The theoretical part of the paper considers the problem of training the neural network, whereas the empirical results focus on its generalization properties.

Results

- The authors observe that the left tails for all Λ touches the barrier that is hard to penetrate and as Λ increases the values concentrate around −E∞.
- This concentration result has long been predicted but not proved until [Auffinger et al, 2010].
- The variance decreases with higher network sizes
- This is clearly captured in Figure 8 and 9 in the Supplementary ma-

Conclusion

- This paper establishes a connection between the neural network and the spin-glass model.
- The authors show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth H has similar landscape to the Hamiltonian of the H-spin spherical spin-glass model.
- The authors empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks.
- To the best of the knowledge the work is one of the first efforts in the literature to shed light on the theory of neural network optimization

Summary

## Introduction:

The authors study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity.- The authors empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks
- The authors conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error.
- The authors prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting
## Methods:

The theoretical part of the paper considers the problem of training the neural network, whereas the empirical results focus on its generalization properties.## Results:

The authors observe that the left tails for all Λ touches the barrier that is hard to penetrate and as Λ increases the values concentrate around −E∞.- This concentration result has long been predicted but not proved until [Auffinger et al, 2010].
- The variance decreases with higher network sizes
- This is clearly captured in Figure 8 and 9 in the Supplementary ma-
## Conclusion:

This paper establishes a connection between the neural network and the spin-glass model.- The authors show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth H has similar landscape to the Hamiltonian of the H-spin spherical spin-glass model.
- The authors empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks.
- To the best of the knowledge the work is one of the first efforts in the literature to shed light on the theory of neural network optimization

- Table1: Pearson correlation between training and test loss for different numbers of hidden units

Reference

- [Amit et al., 1985] Amit, D. J., Gutfreund, H., and Sompolinsky, H. (1985). Spin-glass models of neural networks. Phys. Rev. A, 32:1007–1018.
- [Auffinger and Ben Arous, 2013] Auffinger, A. and Ben Arous, G. (2013). Complexity of random smooth functions on the high-dimensional sphere. arXiv:1110.5872.
- [Auffinger et al., 2010] Auffinger, A., Ben Arous, G., and Cerny, J. (2010). Random matrices and complexity of spin glasses. arXiv:1003.1129.
- [Baldi and Hornik, 1989] Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2:53–58.
- [Bottou, 1998] Bottou, L. (1998). Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press.
- [Bray and Dean, 2007] Bray, A. J. and Dean, D. S. (2007). The statistics of critical points of gaussian fields on large-dimensional spaces. Physics Review Letter.
- [Dauphin et al., 2014] Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS.
- [De la Pena and Gine, 1999] De la Pena, V. H. and Gine, E. (1999). Decoupling: from dependence to independence: randomly stopped processes, Ustatistics and processes, martingales and beyond. Probability and its applications. Springer.
- [Denil et al., 2013] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and Freitas, N. D. (2013). Predicting parameters in deep learning. In NIPS.
- [Denton et al., 2014] Denton, E., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS.
- [Dotsenko, 1995] Dotsenko, V. (1995). An Introduction to the Theory of Spin Glasses and Neural Networks. World Scientific Lecture Notes in Physics.
- [Fyodorov and Williams, 2007] Fyodorov, Y. V. and Williams, I. (2007). Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. Journal of Statistical Physics, 129(5-6),1081-1116.
- [Goodfellow et al., 2013] Goodfellow, I. J., WardeFarley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In ICML.
- [Hastie et al., 2001] Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics.
- [Hinton et al., 2012] Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine.
- [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
- [LeCun et al., 1998a] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324.
- [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. (1998b). Efficient backprop. In Neural Networks: Tricks of the trade. Springer.
- [Nair and Hinton, 2010] Nair, V. and Hinton, G. (2010). Rectified linear units improve restricted boltzmann machines. In ICML.
- [Nakanishi and Takayama, 1997] Nakanishi, K. and Takayama, H. (1997). Mean-field theory for a spinglass model of neural networks: Tap free energy and the paramagnetic to spin-glass transition. Journal of Physics A: Mathematical and General, 30:8085.
- [Saad, 2009] Saad, D. (2009). On-line learning in neural networks, volume 17. Cambridge University Press.
- [Saad and Solla, 1995] Saad, D. and Solla, S. A. (1995). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters, 74(21):4337.
- [Saxe et al., 2014] Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR.
- [Weston et al., 2014] Weston, J., Chopra, S., and Adams, K. (2014). #tagspace: Semantic embeddings from hashtags. In EMNLP.
- [Wigner, 1958] Wigner, E. P. (1958). On the Distribution of the Roots of Certain Symmetric Matrices. The Annals of Mathematics, 67:325–327.

Full Text

Tags

Comments