The Loss Surfaces of Multilayer Networks

JMLR Workshop and Conference Proceedings, pp. 192-204, 2015.

Cited by: 824|Bibtex|Views201
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks

Abstract:

We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to ex...More

Code:

Data:

Introduction
  • The authors study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity.
  • The authors empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks
  • The authors conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error.
  • The authors prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting
Highlights
  • We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity
  • We found that stochastic gradient descent performed at least as well as simulated annealing, which indicates that becoming trapped in poor saddle points is not a problem in our experiments
  • For the spin-glasses, we see that for small values of Λ, we obtain poor local minima on many experiments, while for larger values of Λ the distribution becomes increasingly concentrated around the energy barrier where local minima have high quality
  • This paper establishes a connection between the neural network and the spin-glass model
  • We show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth H has similar landscape to the Hamiltonian of the H-spin spherical spin-glass model
  • We empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks
Methods
  • The theoretical part of the paper considers the problem of training the neural network, whereas the empirical results focus on its generalization properties.
Results
  • The authors observe that the left tails for all Λ touches the barrier that is hard to penetrate and as Λ increases the values concentrate around −E∞.
  • This concentration result has long been predicted but not proved until [Auffinger et al, 2010].
  • The variance decreases with higher network sizes
  • This is clearly captured in Figure 8 and 9 in the Supplementary ma-
Conclusion
  • This paper establishes a connection between the neural network and the spin-glass model.
  • The authors show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth H has similar landscape to the Hamiltonian of the H-spin spherical spin-glass model.
  • The authors empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks.
  • To the best of the knowledge the work is one of the first efforts in the literature to shed light on the theory of neural network optimization
Summary
  • Introduction:

    The authors study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity.
  • The authors empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks
  • The authors conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error.
  • The authors prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting
  • Methods:

    The theoretical part of the paper considers the problem of training the neural network, whereas the empirical results focus on its generalization properties.
  • Results:

    The authors observe that the left tails for all Λ touches the barrier that is hard to penetrate and as Λ increases the values concentrate around −E∞.
  • This concentration result has long been predicted but not proved until [Auffinger et al, 2010].
  • The variance decreases with higher network sizes
  • This is clearly captured in Figure 8 and 9 in the Supplementary ma-
  • Conclusion:

    This paper establishes a connection between the neural network and the spin-glass model.
  • The authors show that under certain assumptions, the loss function of the fully decoupled large-size neural network of depth H has similar landscape to the Hamiltonian of the H-spin spherical spin-glass model.
  • The authors empirically demonstrate that both models studied here are highly similar in real settings, despite the presence of variable dependencies in real networks.
  • To the best of the knowledge the work is one of the first efforts in the literature to shed light on the theory of neural network optimization
Tables
  • Table1: Pearson correlation between training and test loss for different numbers of hidden units
Download tables as Excel
Reference
  • [Amit et al., 1985] Amit, D. J., Gutfreund, H., and Sompolinsky, H. (1985). Spin-glass models of neural networks. Phys. Rev. A, 32:1007–1018.
    Google ScholarLocate open access versionFindings
  • [Auffinger and Ben Arous, 2013] Auffinger, A. and Ben Arous, G. (2013). Complexity of random smooth functions on the high-dimensional sphere. arXiv:1110.5872.
    Findings
  • [Auffinger et al., 2010] Auffinger, A., Ben Arous, G., and Cerny, J. (2010). Random matrices and complexity of spin glasses. arXiv:1003.1129.
    Findings
  • [Baldi and Hornik, 1989] Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2:53–58.
    Google ScholarLocate open access versionFindings
  • [Bottou, 1998] Bottou, L. (1998). Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press.
    Google ScholarFindings
  • [Bray and Dean, 2007] Bray, A. J. and Dean, D. S. (2007). The statistics of critical points of gaussian fields on large-dimensional spaces. Physics Review Letter.
    Google ScholarLocate open access versionFindings
  • [Dauphin et al., 2014] Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS.
    Google ScholarFindings
  • [De la Pena and Gine, 1999] De la Pena, V. H. and Gine, E. (1999). Decoupling: from dependence to independence: randomly stopped processes, Ustatistics and processes, martingales and beyond. Probability and its applications. Springer.
    Google ScholarFindings
  • [Denil et al., 2013] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and Freitas, N. D. (2013). Predicting parameters in deep learning. In NIPS.
    Google ScholarFindings
  • [Denton et al., 2014] Denton, E., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS.
    Google ScholarFindings
  • [Dotsenko, 1995] Dotsenko, V. (1995). An Introduction to the Theory of Spin Glasses and Neural Networks. World Scientific Lecture Notes in Physics.
    Google ScholarLocate open access versionFindings
  • [Fyodorov and Williams, 2007] Fyodorov, Y. V. and Williams, I. (2007). Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. Journal of Statistical Physics, 129(5-6),1081-1116.
    Google ScholarLocate open access versionFindings
  • [Goodfellow et al., 2013] Goodfellow, I. J., WardeFarley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In ICML.
    Google ScholarFindings
  • [Hastie et al., 2001] Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics.
    Google ScholarFindings
  • [Hinton et al., 2012] Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine.
    Google ScholarLocate open access versionFindings
  • [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
    Google ScholarFindings
  • [LeCun et al., 1998a] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324.
    Google ScholarLocate open access versionFindings
  • [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. (1998b). Efficient backprop. In Neural Networks: Tricks of the trade. Springer.
    Google ScholarLocate open access versionFindings
  • [Nair and Hinton, 2010] Nair, V. and Hinton, G. (2010). Rectified linear units improve restricted boltzmann machines. In ICML.
    Google ScholarFindings
  • [Nakanishi and Takayama, 1997] Nakanishi, K. and Takayama, H. (1997). Mean-field theory for a spinglass model of neural networks: Tap free energy and the paramagnetic to spin-glass transition. Journal of Physics A: Mathematical and General, 30:8085.
    Google ScholarLocate open access versionFindings
  • [Saad, 2009] Saad, D. (2009). On-line learning in neural networks, volume 17. Cambridge University Press.
    Google ScholarFindings
  • [Saad and Solla, 1995] Saad, D. and Solla, S. A. (1995). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters, 74(21):4337.
    Google ScholarLocate open access versionFindings
  • [Saxe et al., 2014] Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR.
    Google ScholarFindings
  • [Weston et al., 2014] Weston, J., Chopra, S., and Adams, K. (2014). #tagspace: Semantic embeddings from hashtags. In EMNLP.
    Google ScholarLocate open access versionFindings
  • [Wigner, 1958] Wigner, E. P. (1958). On the Distribution of the Roots of Certain Symmetric Matrices. The Annals of Mathematics, 67:325–327.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments