A Simple Way to Initialize Recurrent Networks of Rectified Linear Units

CoRR, 2015.

Cited by: 234|Bibtex|Views135|Links
EI
Keywords:
Frame error rateshessian free optimizationlong term dependencyneural networkrecurrent networkMore(8+)
Wei bo:
A second aim of this paper is to explore whether rectified linear units can be made to work well in Recurrent neural networks and whether the ease of optimizing them in feedforward nets transfers to Recurrent neural networks

Abstract:

Learning long term dependencies in recurrent networks is difficult due to vanishing and exploding gradients. To overcome this difficulty, researchers have developed sophisticated optimization techniques and network architectures. In this paper, we propose a simpler solution that use recurrent neural networks composed of rectified linear...More

Code:

Data:

Introduction
  • Recurrent neural networks (RNNs) are very powerful dynamical systems and they are the natural way of using neural networks to map an input sequence to an output sequence, as in speech recognition and machine translation, or to predict the term in a sequence, as in language modeling.
  • The most successful technique to date is the Long Short Term Memory (LSTM) Recurrent Neural Network which uses stochastic gradient descent, but changes the hidden units in such a way that the backpropagated gradients are much better behaved [16].
  • These gates are logistic units with their own learned weights on connections coming from the input and the memory cells at the previous time-step.
Highlights
  • Recurrent neural networks (RNNs) are very powerful dynamical systems and they are the natural way of using neural networks to map an input sequence to an output sequence, as in speech recognition and machine translation, or to predict the next term in a sequence, as in language modeling
  • Further developments of the HF approach look promising [35, 25] but are much harder to implement than popular simple methods such as stochastic gradient descent with momentum [34] or adaptive learning rates for each weight that depend on the history of its gradients [5, 14]
  • A second aim of this paper is to explore whether rectified linear units can be made to work well in Recurrent neural networks and whether the ease of optimizing them in feedforward nets transfers to Recurrent neural networks
  • We demonstrate that, with the right initialization of the weights, Recurrent neural networks composed of rectified linear units are relatively easy to train and are good at modeling long-range dependencies
  • The results using the standard scanline ordering of the pixels show that this problem is so difficult that standard Recurrent neural networks fail to work, even with rectified linear units, whereas the IRNN achieves 3% test error rate which is better than most off-the-shelf linear classifiers [21]
Results
  • The authors demonstrate that, with the right initialization of the weights, RNNs composed of rectified linear units are relatively easy to train and are good at modeling long-range dependencies.
  • Their performance on test data is comparable with LSTMs, both for toy problems involving very long-range temporal structures and for real tasks like predicting the word in a very large corpus of text.
  • This is the same behavior as LSTMs when their forget gates are set so that there is no decay and it makes it easy to learn very long-range temporal dependencies.
  • The authors compared IRNNs with LSTMs on a large language modeling task.
  • The authors compare IRNNs against LSTMs, RNNs that use tanh units and RNNs that use ReLUs with random Gaussian initialization.
  • It is observed that setting a higher initial forget gate bias for LSTMs can give better results for long term dependency problems.
  • The adding problem is a toy task, designed to examine the power of recurrent models in learning long-term dependencies [16, 15].
  • The authors fixed the hidden states to have 100 units for all of the networks (LSTMs, RNNs and IRNNs).
  • The results using the standard scanline ordering of the pixels show that this problem is so difficult that standard RNNs fail to work, even with ReLUs, whereas the IRNN achieves 3% test error rate which is better than most off-the-shelf linear classifiers [21].
Conclusion
  • As LSTM have more parameters per time step, the authors compared them with an IRNN that had 4 layers and same number of hidden units per layer.
  • For the speech task, the authors are not only showing that iRNNs work much better than RNNs composed of tanh units, but the authors are showing that initialization with the full identity is suboptimal when long range effects are not needed.
  • In general in the speech recognition task, the iRNN outperforms the RNN that uses tanh units and is comparable to LSTM the authors don’t rule out the possibility that with very careful tuning of hyperparameters, the relative performance of LSTMs or the iRNNs might change.
Summary
  • Recurrent neural networks (RNNs) are very powerful dynamical systems and they are the natural way of using neural networks to map an input sequence to an output sequence, as in speech recognition and machine translation, or to predict the term in a sequence, as in language modeling.
  • The most successful technique to date is the Long Short Term Memory (LSTM) Recurrent Neural Network which uses stochastic gradient descent, but changes the hidden units in such a way that the backpropagated gradients are much better behaved [16].
  • These gates are logistic units with their own learned weights on connections coming from the input and the memory cells at the previous time-step.
  • The authors demonstrate that, with the right initialization of the weights, RNNs composed of rectified linear units are relatively easy to train and are good at modeling long-range dependencies.
  • Their performance on test data is comparable with LSTMs, both for toy problems involving very long-range temporal structures and for real tasks like predicting the word in a very large corpus of text.
  • This is the same behavior as LSTMs when their forget gates are set so that there is no decay and it makes it easy to learn very long-range temporal dependencies.
  • The authors compared IRNNs with LSTMs on a large language modeling task.
  • The authors compare IRNNs against LSTMs, RNNs that use tanh units and RNNs that use ReLUs with random Gaussian initialization.
  • It is observed that setting a higher initial forget gate bias for LSTMs can give better results for long term dependency problems.
  • The adding problem is a toy task, designed to examine the power of recurrent models in learning long-term dependencies [16, 15].
  • The authors fixed the hidden states to have 100 units for all of the networks (LSTMs, RNNs and IRNNs).
  • The results using the standard scanline ordering of the pixels show that this problem is so difficult that standard RNNs fail to work, even with ReLUs, whereas the IRNN achieves 3% test error rate which is better than most off-the-shelf linear classifiers [21].
  • As LSTM have more parameters per time step, the authors compared them with an IRNN that had 4 layers and same number of hidden units per layer.
  • For the speech task, the authors are not only showing that iRNNs work much better than RNNs composed of tanh units, but the authors are showing that initialization with the full identity is suboptimal when long range effects are not needed.
  • In general in the speech recognition task, the iRNN outperforms the RNN that uses tanh units and is comparable to LSTM the authors don’t rule out the possibility that with very careful tuning of hyperparameters, the relative performance of LSTMs or the iRNNs might change.
Tables
  • Table1: Best hyperparameters found for adding problems after grid search. lr is the learning rate, gc is gradient clipping, and f b is forget gate bias. N/A is when there is no hyperparameter combination that gives good result
  • Table2: Best hyperparameters found for pixel-by-pixel MNIST problems after grid search. lr is the learning rate, gc is gradient clipping, and f b is the forget gate bias
  • Table3: Performances of recurrent methods on the 1 billion word benchmark
  • Table4: Frame error rates of recurrent methods on the TIMIT phone recognition task
Download tables as Excel
Funding
  • Proposes a simpler solution that use recurrent neural networks composed of rectified linear units
  • Finds that our solution is comparable to a standard implementation of LSTMs on our four benchmarks: two toy problems involving long-range temporal structures, a large language modeling problem and a benchmark speech recognition problem
  • A second aim of this paper is to explore whether ReLUs can be made to work well in RNNs and whether the ease of optimizing them in feedforward nets transfers to RNNs
  • With the right initialization of the weights, RNNs composed of rectified linear units are relatively easy to train and are good at modeling long-range dependencies
  • Finds that for tasks that exhibit less long range dependencies, scaling the identity matrix by a small scalar is an effective mechanism to forget long range effects
Reference
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, and P. Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013.
    Findings
  • G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing - Special Issue on Deep Learning for Speech and Language Processing, 2012.
    Google ScholarLocate open access versionFindings
  • J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
    Google ScholarLocate open access versionFindings
  • F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000.
    Google ScholarLocate open access versionFindings
  • F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 2003.
    Google ScholarLocate open access versionFindings
  • A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
  • A. Graves. Generating sequences with recurrent neural networks. In Arxiv, 2013.
    Google ScholarLocate open access versionFindings
  • A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning, 2014.
    Google ScholarLocate open access versionFindings
  • A. Graves, N. Jaitly, and A-R. Mohamed. Hybrid speech recognition with deep bidirectional lstm. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),, 2013.
    Google ScholarLocate open access versionFindings
  • A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009.
    Google ScholarLocate open access versionFindings
  • A. Graves, A-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
    Google ScholarLocate open access versionFindings
  • G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks, 2001.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
    Google ScholarLocate open access versionFindings
  • N. Jaitly. Exploring Deep Learning Methods for discovering features in speech signals. PhD thesis, University of Toronto, 2014.
    Google ScholarFindings
  • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
    Google ScholarLocate open access versionFindings
  • T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206, 2014.
    Findings
  • J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning, 2010.
    Google ScholarLocate open access versionFindings
  • J. Martens and I. Sutskever. Learning recurrent neural networks with Hessian-Free optimization. In ICML, 2011.
    Google ScholarLocate open access versionFindings
  • J. Martens and I. Sutskever. Training deep and recurrent neural networks with Hessian-Free optimization. Neural Networks: Tricks of the Trade, 2012.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. A. Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.
    Findings
  • V. Nair and G. Hinton. Rectified Linear Units improve Restricted Boltzmann Machines. In International Conference on Machine Learning, 2010.
    Google ScholarLocate open access versionFindings
  • R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.
    Findings
  • D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.
    Google ScholarLocate open access versionFindings
  • D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
    Google ScholarLocate open access versionFindings
  • A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
    Findings
  • R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing with compositional vector grammars. In ACL, 2013.
    Google ScholarLocate open access versionFindings
  • D. Sussillo and L. F. Abbott. Random walk intialization for training very deep networks. arXiv preprint arXiv:1412.6558, 2015.
    Findings
  • I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, 2013.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning, pages 1017–1024, 2011.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. arXiv preprint arXiv:1412.7449, 2014.
    Findings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
    Findings
  • W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
    Findings
  • M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen, A. Senior, V. Vanhoucke, and J. Dean. On rectified linear units for speech processing. In IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments