AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old multilayer neural networks

Understanding the difficulty of training deep feedforward neural networks

AISTATS, (2010): 249-256

被引用12508|浏览605
EI
下载 PDF 全文
引用
微博一下

摘要

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or trainin...更多

代码

数据

0
简介
  • Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures.
  • The authors study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1
  • Based on these considerations, the authors propose a new initialization scheme that brings substantially faster convergence.
  • Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions, one may need deep architectures
重点内容
  • Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures
  • Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions, one may need deep architectures
  • Here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old multilayer neural networks
  • We have found that the logistic regression or conditional log-likelihood cost function (− log P (y|x) coupled with softmax outputs) worked much better than the quadratic cost which was traditionally used to train feedforward neural networks (Rumelhart et al, 1986)
方法
  • Experiments with the Sigmoid

    The sigmoid non-linearity has been already shown to slow down learning because of its none-zero mean that induces important singular values in the Hessian (LeCun et al, 1998b).
  • The graph shows the means and standard deviations of these activations
  • These statistics along with histograms are computed at different times during learning, by looking at activation values for a fixed set of 300 test examples.
  • The authors can see at the end of training that the histogram of activation values is very different from that seen with the hyperbolic tangent (Figure 4)
  • Whereas the latter yields modes of the activations distribution mostly at the extremes or around 0, the softsign network has modes of activations around its knees.
  • These are the areas where there is substantial non-linearity but
结论
  • The final consideration that the authors care for is the success of training with different strategies, and this is best illustrated with error curves showing the evolution of test error as training progresses and asymptotes.
  • The authors optimized RBF SVM models on one hundred thousand Shapeset examples and obtained 59.47% test error, while on the same set the authors obtained 50.47% with a depth five hyperbolic tangent network with normalized initialization.
  • These results illustrate the effect of the choice of activation and initialization.
  • The authors can remark that on Shapeset-3 × 2, because of the task difficulty, the authors observe important saturations during learning, this might explain that the normalized initialization or the softsign effects are more visible
表格
  • Table1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden layers. N after the activation function name indicates the use of normalized initialization. Results in bold are statistically different from non-bold ones under the null hypothesis test with p = 0.005
Download tables as Excel
基金
  • Finds that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation
  • Finds that a new non-linearity that saturates less can often be beneficial
  • Proposes a new initialization scheme that brings substantially faster convergence
  • Focuses on analyzing what may be going wrong with good old multilayer neural networks
  • Evaluates the effects on these of choices of activation function and initialization procedure
引用论文
  • Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2, 1–127. Also published as a book. Now Publishers, 2009.
    Google ScholarLocate open access versionFindings
  • Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. NIPS 19 (pp. 153–160). MIT Press.
    Google ScholarLocate open access versionFindings
  • Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5, 157–166.
    Google ScholarLocate open access versionFindings
  • Bergstra, J., Desjardins, G., Lamblin, P., & Bengio, Y. (2009). Quadratic polynomials learn better image features (Technical Report 1337). Departement d’Informatique et de Recherche Operationnelle, Universitede Montreal.
    Google ScholarFindings
  • Bradley, D. (2009). Learning in modular systems. Doctoral dissertation, The Robotics Institute, Carnegie Mellon University.
    Google ScholarFindings
  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML 2008.
    Google ScholarLocate open access versionFindings
  • Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P. (2009). The difficulty of training deep architectures and the effect of unsupervised pre-training. AISTATS’2009 (pp. 153– 160).
    Google ScholarFindings
  • Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527– 1554.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Technical Report). University of Toronto.
    Google ScholarFindings
  • Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10, 1–40.
    Google ScholarLocate open access versionFindings
  • Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. ICML 2007.
    Google ScholarFindings
  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bottou, L., Orr, G. B., & Muller, K.-R. (1998b). Efficient backprop. In Neural networks, tricks of the trade, Lecture Notes in Computer Science LNCS 1524. Springer Verlag.
    Google ScholarLocate open access versionFindings
  • Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical distributed language model. NIPS 21 (pp. 1081–1088).
    Google ScholarLocate open access versionFindings
  • Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. NIPS 19.
    Google ScholarFindings
  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
    Google ScholarLocate open access versionFindings
  • Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems, 2, 625–639.
    Google ScholarLocate open access versionFindings
  • Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. ICML 2008.
    Google ScholarLocate open access versionFindings
  • Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. ICML 2008 (pp. 1168–1175). New York, NY, USA: ACM.
    Google ScholarLocate open access versionFindings
  • Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128.
    Google ScholarLocate open access versionFindings
作者
Xavier Glorot
Xavier Glorot
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科