Learning Longer Memory in Recurrent Neural Networks

International Conference on Learning Representations, 2014.

Cited by: 138|Bibtex|Views164
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Empirical comparison of Structurally Constrained Recurrent Network to Long Short Term Memory recurrent network shows very similar behavior in two language modeling tasks, with similar gains over simple recurrent network when all models are tuned for the best accuracy

Abstract:

Recurrent neural network is a powerful model that learns temporal patterns in sequential data. For a long time, it was believed that recurrent networks are difficult to train using simple optimizers, such as stochastic gradient descent, due to the so-called vanishing gradient problem. In this paper, we show that learning longer term pat...More

Code:

Data:

Introduction
  • Models of sequential data, such as natural language, speech and video, are the core of many machine learning applications.
  • Models based on neural networks have been very successful recently, obtaining state-of-the-art performances in automatic speech recognition (Dahl et al, 2012), language modeling (Mikolov, 2012) and video classification (Simonyan & Zisserman, 2014)
  • These models are mostly based on two families of neural networks: feedforward neural networks and recurrent neural networks.
  • While this type of models work well in practice, fixing the window size makes long-term dependency harder to learn and can only be done at the cost of a linear increase of the number of parameters
Highlights
  • Models of sequential data, such as natural language, speech and video, are the core of many machine learning applications
  • We have shown that learning longer term patterns in real data using recurrent networks is perfectly doable using standard stochastic gradient descent, just by introducing structural constraint on the recurrent weight matrix
  • The model can be interpreted as having quickly changing hidden layer that focuses on short term patterns, and slowly updating context layer that retains longer term information
  • Empirical comparison of Structurally Constrained Recurrent Network to Long Short Term Memory (LSTM) recurrent network shows very similar behavior in two language modeling tasks, with similar gains over simple recurrent network when all models are tuned for the best accuracy
  • Structurally Constrained Recurrent Network shines in cases when the size of models is constrained, and with similar number of parameters it often outperforms Long Short Term Memory by a large margin
  • Our model greatly simplifies analysis and implementation of recurrent networks that are capable of learning longer term patterns
Methods
  • The authors evaluate the model on the language modeling task for two datasets. The first dataset is the Penn Treebank Corpus, which consists of 930K words in the training set.
  • The second dataset, which is moderately sized, is called Text8.
  • It is composed of a pre-processed version of the first 100 million characters from Wikipedia dump.
  • The authors did split it into training part and development set that the authors use to report performance.
  • To simplify reproducibility of the results, the authors released both the SCRN code and the scripts which construct the datasets 1
Results
  • RESULTS ON PENN TREEBANK CORPUS

    The authors first report results on the Penn Treebank Corpus using both small and moderately sized models.
  • Table 1 shows that SCRN outperforms the SRN architecture even with much less parameters
  • This can be seen by comparing performance of SCRN with 40 hidden and 10 contextual units versus SRN with 300 hidden units.
  • The authors' experiment involves the Text8 corpus which is significantly larger than the Penn Treebank
  • As this dataset contains various articles from Wikipedia, the longer term information plays bigger role than in the previous experiments.
  • Such model is much better than SRN with 300 hidden units
Conclusion
  • The authors have shown that learning longer term patterns in real data using recurrent networks is perfectly doable using standard stochastic gradient descent, just by introducing structural constraint on the recurrent weight matrix.
  • SCRN shines in cases when the size of models is constrained, and with similar number of parameters it often outperforms LSTM by a large margin.
  • This can be especially useful in cases when the amount of training data is practically unlimited, and even models with thousands of hidden neurons severely underfit the training datasets.
  • The authors published the code that allows to reproduce experiments described in this paper
Summary
  • Introduction:

    Models of sequential data, such as natural language, speech and video, are the core of many machine learning applications.
  • Models based on neural networks have been very successful recently, obtaining state-of-the-art performances in automatic speech recognition (Dahl et al, 2012), language modeling (Mikolov, 2012) and video classification (Simonyan & Zisserman, 2014)
  • These models are mostly based on two families of neural networks: feedforward neural networks and recurrent neural networks.
  • While this type of models work well in practice, fixing the window size makes long-term dependency harder to learn and can only be done at the cost of a linear increase of the number of parameters
  • Methods:

    The authors evaluate the model on the language modeling task for two datasets. The first dataset is the Penn Treebank Corpus, which consists of 930K words in the training set.
  • The second dataset, which is moderately sized, is called Text8.
  • It is composed of a pre-processed version of the first 100 million characters from Wikipedia dump.
  • The authors did split it into training part and development set that the authors use to report performance.
  • To simplify reproducibility of the results, the authors released both the SCRN code and the scripts which construct the datasets 1
  • Results:

    RESULTS ON PENN TREEBANK CORPUS

    The authors first report results on the Penn Treebank Corpus using both small and moderately sized models.
  • Table 1 shows that SCRN outperforms the SRN architecture even with much less parameters
  • This can be seen by comparing performance of SCRN with 40 hidden and 10 contextual units versus SRN with 300 hidden units.
  • The authors' experiment involves the Text8 corpus which is significantly larger than the Penn Treebank
  • As this dataset contains various articles from Wikipedia, the longer term information plays bigger role than in the previous experiments.
  • Such model is much better than SRN with 300 hidden units
  • Conclusion:

    The authors have shown that learning longer term patterns in real data using recurrent networks is perfectly doable using standard stochastic gradient descent, just by introducing structural constraint on the recurrent weight matrix.
  • SCRN shines in cases when the size of models is constrained, and with similar number of parameters it often outperforms LSTM by a large margin.
  • This can be especially useful in cases when the amount of training data is practically unlimited, and even models with thousands of hidden neurons severely underfit the training datasets.
  • The authors published the code that allows to reproduce experiments described in this paper
Tables
  • Table1: Results on Penn Treebank Corpus: n-gram baseline, simple recurrent nets (SRN), long short term memory RNNs (LSTM) and structurally constrained recurrent nets (SCRN). Note that LSTM models have 4x more parameters than SRNs for the same size of hidden layer
  • Table2: Perplexity on the test set of Penn Treebank Corpus with and without learning the weights of the contextual features. Note that in these experiments we used a hierarchical soft-max
  • Table3: Structurally constrained recurrent nets: perplexity for various sizes of the contextual layer, reported on the development set of Text8 dataset
  • Table4: Comparison of various recurrent network architectures, evaluated on the development set of Text8
Download tables as Excel
Funding
  • Shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent
  • Evaluates our model on language modeling tasks on benchmark datasets, wobtains similar performance to the much more complex Long Short Term Memory networks
  • Proposes a simple modification of the SRN to partially solve the vanishing gradient problem
  • Demonstrates that by constraining a part of the recurrent matrix to be close to identity, can drive some hidden units, called context units to behave like a cache model which can capture long term information similar to the topic of a text
  • Shows that our model can obtain competitive performance compared to the state-of-the-art sequence prediction model, LSTM, on language modeling datasets
Reference
  • Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
    Google ScholarLocate open access versionFindings
  • Bengio, Yoshua, Boulanger-Lewandowski, Nicolas, and Pascanu, Razvan. Advances in optimizing recurrent networks. In ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • Dahl, George E, Yu, Dong, Deng, Li, and Acero, Alex. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30–42, 2012.
    Google ScholarLocate open access versionFindings
  • Elman, Jeffrey L. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
    Google ScholarLocate open access versionFindings
  • Goodman, Joshua. Classes for fast maximum entropy training. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 1, pp. 561–564. IEEE, 2001a.
    Google ScholarLocate open access versionFindings
  • Goodman, Joshua T. A bit of progress in language modeling. Computer Speech & Language, 15(4): 403–434, 2001b.
    Google ScholarLocate open access versionFindings
  • Graves, Alex and Schmidhuber, Juergen. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 545–552, 2009.
    Google ScholarLocate open access versionFindings
  • Graves, Alex and Schmidhuber, Jurgen. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
    Google ScholarLocate open access versionFindings
  • Hochreiter, Sepp. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02): 107–116, 1998.
    Google ScholarLocate open access versionFindings
  • Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Jaeger, Herbert, Lukosevicius, Mantas, Popovici, Dan, and Siewert, Udo. Optimization and applications of echo state networks with leaky-integrator neurons. Neural Networks, 20(3):335–352, 2007.
    Google ScholarLocate open access versionFindings
  • Jordan, Michael I. Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 531–546, 1987.
    Google ScholarLocate open access versionFindings
  • Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv preprint arXiv:1503.01007, 2015.
    Findings
  • Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, 2007.
    Google ScholarLocate open access versionFindings
  • Kuhn, Roland and De Mori, Renato. A cache-based natural language model for speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(6):570–583, 1990.
    Google ScholarLocate open access versionFindings
  • LeCun, Yann, Bottou, Leon, Orr, Genevieve, and Muller, Klaus. Efficient backprop. Neural Networks: Tricks of the Trade, pp. 546–546, 1998.
    Google ScholarLocate open access versionFindings
  • Mikolov, Tomas. Statistical language models based on neural networks. PhD thesis, Ph. D. thesis, Brno University of Technology, 2012.
    Google ScholarFindings
  • Mikolov, Tomas and Zweig, Geoffrey. Context dependent recurrent neural network language model. In SLT, pp. 234–239, 2012.
    Google ScholarLocate open access versionFindings
  • Mikolov, Tomas, Kombrink, Stefan, Burget, Lukas, Cernocky, JH, and Khudanpur, Sanjeev. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5528–5531. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Mozer, Michael C. A focused back-propagation algorithm for temporal pattern recognition. Complex systems, 3(4):349–381, 1989.
    Google ScholarLocate open access versionFindings
  • Mozer, Michael C. Neural net architectures for temporal sequence processing. In Santa Fe Institute Studies in The Sciences of Complexity, volume 15, pp. 243–243. Addison-Wessley Publishing CO, 1993.
    Google ScholarLocate open access versionFindings
  • Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807– 814, 2010.
    Google ScholarLocate open access versionFindings
  • Pachitariu, Marius and Sahani, Maneesh. Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650, 2013.
    Findings
  • Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning internal representations by error propagation. Technical report, DTIC Document, 1985.
    Google ScholarFindings
  • Simonyan, Karen and Zisserman, Andrew. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pp. 568–576, 2014.
    Google ScholarLocate open access versionFindings
  • Sundermeyer, Martin, Schluter, Ralf, and Ney, Hermann. Lstm neural networks for language modeling. In INTERSPEECH, 2012.
    Google ScholarLocate open access versionFindings
  • Werbos, Paul J. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988.
    Google ScholarLocate open access versionFindings
  • Williams, Ronald J and Zipser, David. Gradient-based learning algorithms for recurrent networks and their computational complexity. Back-propagation: Theory, architectures and applications, pp. 433–486, 1995.
    Google ScholarFindings
  • Young, Steve, Evermann, Gunnar, Gales, Mark, Hain, Thomas, Kershaw, Dan, Liu, Xunying, Moore, Gareth, Odell, Julian, Ollason, Dave, Povey, Dan, et al. The HTK book, volume 2. Entropic Cambridge Research Laboratory Cambridge, 1997.
    Google ScholarLocate open access versionFindings
  • Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
    Findings
Full Text
Your rating :
0

 

Tags
Comments