Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

CoRR, 2014.

Cited by: 1999|Bibtex|Views304
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Keywords:
long short-term memorychallenging taskneural networkssignal modelingvariable lengthMore(13+)
Weibo:
In this paper we empirically evaluated recurrent neural networks with three widely used recurrent units; a traditional tanh unit, a long short-term memory unit and a recently proposed gated recurrent unit

Abstract:

In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic...More

Code:

Data:

0
Introduction
  • Recurrent neural networks have recently shown promising results in many machine learning tasks, especially when input and/or output are of variable length [see, e.g., Graves, 2012].
  • The authors make from these recent successes is that almost none of these successes were achieved with a vanilla recurrent neural network
  • Rather, it was a recurrent neural network with sophisticated recurrent hidden units, such as long short-term memory units [Hochreiter and Schmidhuber, 1997], that was used in those successful applications.
  • The RNN handles the variable-length sequence by having a recurrent hidden state whose activation at each time is dependent on that of the previous time.
Highlights
  • Recurrent neural networks have recently shown promising results in many machine learning tasks, especially when input and/or output are of variable length [see, e.g., Graves, 2012]
  • The long short-term memory-recurrent neural networks was best with the Ubisoft A, and with the Ubisoft B, the gated recurrent unit-recurrent neural networks performed best
  • In this paper we empirically evaluated recurrent neural networks (RNN) with three widely used recurrent units; (1) a traditional tanh unit, (2) a long short-term memory (LSTM) unit and (3) a recently proposed gated recurrent unit (GRU)
  • Our evaluation focused on the task of sequence modeling on a number of datasets including polyphonic music data and raw speech signal data
  • The evaluation clearly demonstrated the superiority of the gated units; both the long short-term memory unit and gated recurrent unit, over the traditional tanh unit
  • In order to understand better how a gated unit helps learning and to separate out the contribution of each component, for instance gating units in the long short-term memory unit or the gated recurrent unit, of the gating units, more thorough experiments will be required in the future
Methods
  • Experiments Setting

    4.1 Tasks and Datasets

    The authors compare the LSTM unit, GRU and tanh unit in the task of sequence modeling.
  • Sequence modeling aims at learning a probability distribution over sequences, as in Eq (3), by maximizing the log-likelihood of a model given a set of training sequences: 1 max θN N.
  • For the polyphonic music modeling, the authors use three polyphonic music datasets from [BoulangerLewandowski et al, 2012]: Nottingham, JSB Chorales, MuseData and Piano-midi.
  • These datasets contain sequences of which each symbol is respectively a 93-, 96-, 105-, and 108-dimensional binary vector.
Results
  • Results and Analysis

    Table 2 lists all the results from the experiments. In the case of the polyphonic music datasets, the GRU-RNN outperformed all the others (LSTM-RNN and tanh-RNN) on all the datasets except for the Nottingham.
  • In the case of the music datasets (Fig. 2), the authors see that the GRU-RNN makes faster progress in terms of both the number of updates and actual CPU time.
  • If the authors consider the Ubisoft datasets (Fig. 3), it is clear that the computational requirement for each update in the tanh-RNN is much smaller than the other models, it did not make much progress each update and eventually stopped making any progress at much worse level
Conclusion
  • The most prominent feature shared between these units is the additive component of their update from t to t + 1, which is lacking in the traditional recurrent unit.
  • The evaluation clearly demonstrated the superiority of the gated units; both the LSTM unit and GRU, over the traditional tanh unit.
  • This was more evident with the more challenging task of raw speech signal modeling.
  • In order to understand better how a gated unit helps learning and to separate out the contribution of each component, for instance gating units in the LSTM unit or the GRU, of the gating units, more thorough experiments will be required in the future
Summary
  • Introduction:

    Recurrent neural networks have recently shown promising results in many machine learning tasks, especially when input and/or output are of variable length [see, e.g., Graves, 2012].
  • The authors make from these recent successes is that almost none of these successes were achieved with a vanilla recurrent neural network
  • Rather, it was a recurrent neural network with sophisticated recurrent hidden units, such as long short-term memory units [Hochreiter and Schmidhuber, 1997], that was used in those successful applications.
  • The RNN handles the variable-length sequence by having a recurrent hidden state whose activation at each time is dependent on that of the previous time.
  • Methods:

    Experiments Setting

    4.1 Tasks and Datasets

    The authors compare the LSTM unit, GRU and tanh unit in the task of sequence modeling.
  • Sequence modeling aims at learning a probability distribution over sequences, as in Eq (3), by maximizing the log-likelihood of a model given a set of training sequences: 1 max θN N.
  • For the polyphonic music modeling, the authors use three polyphonic music datasets from [BoulangerLewandowski et al, 2012]: Nottingham, JSB Chorales, MuseData and Piano-midi.
  • These datasets contain sequences of which each symbol is respectively a 93-, 96-, 105-, and 108-dimensional binary vector.
  • Results:

    Results and Analysis

    Table 2 lists all the results from the experiments. In the case of the polyphonic music datasets, the GRU-RNN outperformed all the others (LSTM-RNN and tanh-RNN) on all the datasets except for the Nottingham.
  • In the case of the music datasets (Fig. 2), the authors see that the GRU-RNN makes faster progress in terms of both the number of updates and actual CPU time.
  • If the authors consider the Ubisoft datasets (Fig. 3), it is clear that the computational requirement for each update in the tanh-RNN is much smaller than the other models, it did not make much progress each update and eventually stopped making any progress at much worse level
  • Conclusion:

    The most prominent feature shared between these units is the additive component of their update from t to t + 1, which is lacking in the traditional recurrent unit.
  • The evaluation clearly demonstrated the superiority of the gated units; both the LSTM unit and GRU, over the traditional tanh unit.
  • This was more evident with the more challenging task of raw speech signal modeling.
  • In order to understand better how a gated unit helps learning and to separate out the contribution of each component, for instance gating units in the LSTM unit or the GRU, of the gating units, more thorough experiments will be required in the future
Tables
  • Table1: The sizes of the models tested in the experiments
  • Table2: The average negative log-probabilities of the training and test sets
Download tables as Excel
Funding
  • We acknowledge the support of the following agencies for research funding and computing support: NSERC, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR
Reference
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. Technical report, arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In Proc. ICASSP 38, 2013.
    Google ScholarLocate open access versionFindings
  • J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281–305, 2012.
    Google ScholarLocate open access versionFindings
  • J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. WardeFarley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.
    Google ScholarLocate open access versionFindings
  • N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML’12). ACM, 2012. URL http://icml.cc/discuss/2012/590.html.
    Locate open access versionFindings
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
    Findings
  • I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, and Y. Bengio. Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214, 2013.
    Findings
  • A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer, 2012.
    Google ScholarLocate open access versionFindings
  • A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
    Google ScholarLocate open access versionFindings
  • A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
  • A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP’2013, pages 6645–6649. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • G. Hinton. Neural networks for machine learning. Coursera, video lectures, 2012. S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fur
    Google ScholarLocate open access versionFindings
  • Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen, 1991. URL http://www7.informatik.tu-muenchen.de/̃Ehochreit. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. J. Martens and I. Sutskever. Learning recurrent neural networks with Hessian-free optimization. In Proc. ICML’2011. ACM, 2011.
    Locate open access versionFindings
  • R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). ACM, 2013. URL http://icml.cc/2013/. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. Technical report, arXiv preprint arXiv:1409.3215, 2014.
    Findings
Full Text
Your rating :
0

 

Tags
Comments