Layer Normalization

CoRR, Volume abs/1607.06450, 2016.

Cited by: 11|Bibtex|Views113|Links
EI
Keywords:
subjectivity/objectivity classificationtraining timerecurrent neural networkstatisticsbatch normalizationMore(10+)
Wei bo:
We introduced layer normalization to speed-up the training of neural networks

Abstract:

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and varian...More

Code:

Data:

Introduction
  • Deep neural networks trained with some version of Stochastic Gradient Descent have been shown to substantially outperform previous approaches on various supervised learning tasks in computer vision [Krizhevsky et al, 2012] and speech processing [Hinton et al, 2012].
  • Batch normalization [Ioffe and Szegedy, 2015] has been proposed to reduce training time by including additional normalization stages in deep neural networks.
  • Consider the lth hidden layer in a deep feed-forward, neural network, and let al be the vector representation of the summed inputs to the neurons in that layer.
  • The method normalizes the summed inputs to each hidden unit over the training cases.
  • For the ith summed input in the lth layer, the batch normalization method rescales the summed inputs according to their variances under the distribution of the data ali
Highlights
  • Deep neural networks trained with some version of Stochastic Gradient Descent have been shown to substantially outperform previous approaches on various supervised learning tasks in computer vision [Krizhevsky et al, 2012] and speech processing [Hinton et al, 2012]
  • We refer the reader to the appendix for a description of how layer normalization is applied to GRU
  • We show how layer normalization compares with batch normalization on the well-studied permutation invariant MNIST classification problem
  • We introduced layer normalization to speed-up the training of neural networks
  • We provided a theoretical analysis that compared the invariance properties of layer normalization with batch normalization and weight normalization
  • We showed that layer normalization is invariant to per training-case feature shifting and scaling
Methods
  • SICK(r) SICK(ρ) SICK(MSE) MR CR SUBJ MPQA

    Original [Kiros et al, 2015] 0.848

    Ours Ours + LN Ours + LN †

    encoded with a encoder RNN and decoder RNNs are used to predict the surrounding sentences. Kiros et al [2015] showed that this model could produce generic sentence representations that perform well on several tasks without being fine-tuned.
  • Kiros et al [2015] showed that this model could produce generic sentence representations that perform well on several tasks without being fine-tuned
  • Training this model is timeconsuming, requiring several days of training in order to produce meaningful results.
Results
  • The authors perform experiments with layer normalization on 6 tasks, with a focus on recurrent neural networks: image-sentence ranking, question-answering, contextual language modelling, generative modelling, handwriting sequence generation and MNIST classification.
  • The best performing models are evaluated on 5 separate test sets, each containing 1000 images and 5000 captions, for which the mean results are reported
  • Both models use Adam [Kingma and Ba, 2014] with the same initial hyperparameters and both models are trained using the same architectural choices as used in Vendrov et al [2016].
Conclusion
  • The authors introduced layer normalization to speed-up the training of neural networks.
  • The authors provided a theoretical analysis that compared the invariance properties of layer normalization with batch normalization and weight normalization.
  • The authors showed that layer normalization is invariant to per training-case feature shifting and scaling.
  • The authors showed that recurrent neural networks benefit the most from the proposed method especially for long sequences and small mini-batches
Summary
  • Deep neural networks trained with some version of Stochastic Gradient Descent have been shown to substantially outperform previous approaches on various supervised learning tasks in computer vision [Krizhevsky et al, 2012] and speech processing [Hinton et al, 2012].
  • The summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.
  • This paper introduces layer normalization, a simple normalization method to improve the training speed for various neural network models.
  • We show that layer normalization works well for RNNs and improves both the training time and the generalization performance of several existing RNN models.
  • The authors show that initializing the gain parameter in the recurrent batch normalization layer to 0.1 makes significant difference in the final performance of the model.
  • Applying either weight normalization or batch normalization using expected statistics is equivalent to have a different parameterization of the original feed-forward neural network.
  • Layer normalization is invariant to re-scaling of individual training cases, because the normalization scalars μ and σ in Eq (3) only depend on the current input data.
  • We compare how the model output changes between updating the gain parameters in the normalized GLM and updating the magnitude of the equivalent weights under original parameterization during learning.
  • We show that Riemannian metric along the magnitude of the incoming weights for the standard GLM is scaled by the norm of its input, whereas learning the gain parameters for the batch normalized and layer normalized models depends only on the magnitude of the prediction error.
  • We perform experiments with layer normalization on 6 tasks, with a focus on recurrent neural networks: image-sentence ranking, question-answering, contextual language modelling, generative modelling, handwriting sequence generation and MNIST classification.
  • We apply layer normalization to the recently proposed order-embeddings model of Vendrov et al [2016] for learning a joint embedding space of images and sentences.
  • In order to compare layer normalization to the recently proposed recurrent batch normalization [Cooijmans et al, 2016], we train an unidirectional attentive reader model on the CNN corpus both introduced by Hermann et al [2015].
  • We let the model with layer normalization train for a total of a month, resulting in further performance gains across all but one task.
  • After 200 epoches, the baseline model converges to a variational log likelihood of 82.36 nats on the test data and the layer normalization model obtains 82.09 nats.
  • Figure 5 shows that layer normalization converges to a comparable log likelihood as the baseline model but is much faster.
  • We showed that recurrent neural networks benefit the most from the proposed method especially for long sequences and small mini-batches
Tables
  • Table1: Invariance properties under the normalization methods
  • Table2: Average results across 5 test splits for caption and image retrieval. R@K is Recall@K (high is good). Mean r is the mean rank (low is good). Sym corresponds to the symmetric baseline while OE indicates order-embeddings
  • Table3: Skip-thoughts results. The first two evaluation columns indicate Pearson and Spearman correlation, the third is mean squared error and the remaining indicate classification accuracy. Higher is better for all evaluations except MSE. Our models were trained for 1M iterations with the exception of (†) which was trained for 1 month (approximately 1.7M iterations)
Download tables as Excel
Related work
  • Batch normalization has been previously extended to recurrent neural networks [Laurent et al, 2015, Amodei et al, 2015, Cooijmans et al, 2016]. The previous work [Cooijmans et al, 2016] suggests the best performance of recurrent batch normalization is obtained by keeping independent normalization statistics for each time-step. The authors show that initializing the gain parameter in the recurrent batch normalization layer to 0.1 makes significant difference in the final performance of the model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. In weight normalization, instead of the variance, the L2 norm of the incoming weights is used to normalize the summed inputs to a neuron. Applying either weight normalization or batch normalization using expected statistics is equivalent to have a different parameterization of the original feed-forward neural network. Re-parameterization in the ReLU network was studied in the Pathnormalized SGD [Neyshabur et al, 2015]. Our proposed layer normalization method, however, is not a re-parameterization of the original neural network. The layer normalized model, thus, has different invariance properties than the other methods, that we will study in the following section.
Funding
  • This research was funded by grants from NSERC, CFI, and Google
Reference
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Cesar Laurent, Gabriel Pereyra, Philemon Brakel, Ying Zhang, and Yoshua Bengio. Batch normalized recurrent neural networks. arXiv preprint arXiv:1510.01378, 2015.
    Findings
  • Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
    Findings
  • Tim Cooijmans, Nicolas Ballas, Cesar Laurent, and Aaron Courville. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.
    Findings
  • Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868, 2016.
    Findings
  • Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2413–2421, 2015.
    Google ScholarLocate open access versionFindings
  • Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 1998.
    Google ScholarLocate open access versionFindings
  • Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frederic Bastien, Justin Bayer, Anatoly Belikov, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
    Findings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
    Findings
  • D. Kingma and J. L. Ba. Adam: a method for stochastic optimization. ICLR, 2014. arXiv:1412.6980.
    Findings
  • Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. SemEval-2014, 2014.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115–124, 2005.
    Google ScholarLocate open access versionFindings
  • Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, 2004.
    Google ScholarLocate open access versionFindings
  • Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 2005.
    Google ScholarFindings
  • K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. DRAW: a recurrent neural network for image generation. arXiv:1502.04623, 2015.
    Findings
  • Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS, volume 6, page 622, 2011.
    Google ScholarLocate open access versionFindings
  • Marcus Liwicki and Horst Bunke. Iam-ondb-an on-line english sentence database acquired from handwritten text on a whiteboard. In ICDAR, 2005.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
Your rating :
0

 

Tags
Comments