Simple Recurrent Units for Highly Parallelizable Recurrence

EMNLP, pp. 4470-4481, 2018.

Cited by: 78|Bibtex|Views72|Links
EI
Keywords:
Stanford Question Answering Datasetrecurrent networkrecurrent unitLong Short-term Memorybits per characterMore(15+)
Weibo:
Our 5layer model obtains an average improvement of 0.7 test BLEU score and an improvement of 0.5 BLEU score by comparing the best results of each model achieved across three runs

Abstract:

Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementa...More

Code:

Data:

Introduction
Highlights
Results
  • Figure 3 shows validation performance relative to training time for SRU, cuDNN LSTM and the CNN model.
  • Our 5layer model obtains an average improvement of 0.7 test BLEU score and an improvement of 0.5 BLEU score by comparing the best results of each model achieved across three runs.
  • The 8-layer SRU model achieves validation and test bits per character (BPC) of 1.21, outperforming previous best reported results of LSTM, QRNN and recurrent highway networks (RHN).
Conclusion
  • This work presents Simple Recurrent Unit (SRU), a scalable recurrent architecture that operates as fast as feed-forward and convolutional units.
  • The authors confirm the effectiveness of SRU on multiple natural language tasks ranging from classification to translation.
  • Trading capacity with layers SRU achieves high parallelization by simplifying the hidden-tohidden dependency.
  • This simplification is likely to reduce the representational power of a single layer and should be balanced to avoid performance loss.
  • The authors' empirical results on various tasks confirm this hypothesis
Summary
  • Introduction:

    Recurrent neural networks (RNN) are at the core of state-of-the-art approaches for a large number of natural language tasks, including machine translation (Cho et al, 2014; Bahdanau et al, 2015; Jean et al, 2015; Luong et al, 2015), language modeling (Zaremba et al, 2014; Gal and Ghahramani, 2016; Zoph and Le, 2016), opinion mining (Irsoy and Cardie, 2014), and situated language understanding (Mei et al, 2016; Misra et al, 2017; Suhr et al, 2018; Suhr and Artzi, 2018).
  • The difficulty of scaling recurrent networks arises from the time dependence of state computation
  • In common architectures, such as Long Short-term Memory (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU; Cho et al, 2014), the computation of each step is suspended until the complete execution of the previous step.
  • Recent translation models consist of non-recurrent components only, such as attention and convolution, to scale model training (Gehring et al, 2017; Vaswani et al, 2017)
  • Results:

    Figure 3 shows validation performance relative to training time for SRU, cuDNN LSTM and the CNN model.
  • Our 5layer model obtains an average improvement of 0.7 test BLEU score and an improvement of 0.5 BLEU score by comparing the best results of each model achieved across three runs.
  • The 8-layer SRU model achieves validation and test bits per character (BPC) of 1.21, outperforming previous best reported results of LSTM, QRNN and recurrent highway networks (RHN).
  • Conclusion:

    This work presents Simple Recurrent Unit (SRU), a scalable recurrent architecture that operates as fast as feed-forward and convolutional units.
  • The authors confirm the effectiveness of SRU on multiple natural language tasks ranging from classification to translation.
  • Trading capacity with layers SRU achieves high parallelization by simplifying the hidden-tohidden dependency.
  • This simplification is likely to reduce the representational power of a single layer and should be balanced to avoid performance loss.
  • The authors' empirical results on various tasks confirm this hypothesis
Tables
  • Table1: Test accuracies on classification benchmarks (Section 4.1). The first block presents best reported results of various methods. The second block compares SRU and other baselines given the same setup. For the SST dataset, we report average results of 5 runs. For other datasets, we perform 3 independent trials of 10-fold cross validation (3⇥10 runs). The last column compares the wall clock time (in seconds) to finish 100 epochs on the SST dataset
  • Table2: Exact match (EM) and F1 scores of various models on SQuAD (Section 4.2). We also report the total processing time per epoch and the time spent in RNN computations. SRU outperforms other models, and is more than five times faster than cuDNN LSTM
  • Table3: English!German t6r8a%nslation reswu/lStRsU((4S, 0e.c1)tion 4.3). We perform 3 independent runs for each configuration. We select the best epoch bw/aSsReUd(5,o0n.2) the valid BLEU score for each run, and report the
  • Table4: Validation and test BPCs of different recurrent models on Enwik8 dataset. The last column presents the training time per epoch. For SRU with projection, we set the projection dimension to 512
  • Table5: Ablation analysis on SQuAD. Components are successively removed and the EM scores are CaRveragCeRd oSvUeBrJ4 SruUBnJs. MR MR Trec Trec
Download tables as Excel
Related work
  • Improving on common architectures for sequence processing has recently received significant attention (Greff et al, 2017; Balduzzi and Ghifary, 2016; Miao et al, 2016; Zoph and Le, 2016; Lee et al, 2017). One area of research involves incorporating word-level convolutions (i.e. n-gram filters) into recurrent computation (Lei et al, 2015; Bradbury et al, 2017; Lei et al, 2017). For example, Quasi-RNN (Bradbury et al, 2017) proposes to alternate convolutions and a minimalist recurrent pooling function and achieves significant speed-up over LSTM. While Bradbury et al (2017) focus on the speed advantages of the network, Lei et al (2017) study the theoretical characteristics of such computation and possible extensions. Their results suggest that simplified recurrence retains strong modeling capacity through layer stacking. This finding motivates the design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al, 2015), which uses an identity diagonal matrix to initialize hidden-to-hidden connections. SRU uses point-wise multiplication for hidden connections, which is equivalent to using a diagonal weight matrix. This can be seen as a constrained version of diagonal initialization.
Reference
  • Fabio Anselmi, Lorenzo Rosasco, Cheston Tan, and Tomaso A. Poggio. 2015. Deep convolutional networks are hierarchical kernel machines. CoRR, abs/1508.01084.
    Findings
  • Jeremy Appleyard, Tomás Kociský, and Phil Blunsom. 2016. Optimizing performance of recurrent neural networks on gpus. CoRR, abs/1604.01946.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • David Balduzzi and Muhammad Ghifary. 2016. Strongly-typed recurrent neural networks. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer opendomain questions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yining Chen, Sorcha Gilroy, Kevin Knight, and Jonathan May. 2018. Recurrent neural networks as weighted language recognizers. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, ÃGaglar GülÃgehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder– decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2016. Hierarchical multiscale recurrent neural networks. CoRR, abs/1609.01704.
    Findings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Amit Daniely, Roy Frostig, and Yoram Singer. 2016. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent rnns: Stashing recurrent weights on-chip. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. 2017. Convolutional sequence to sequence learning. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the international conference on artificial intelligence and statistics.
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017.
    Google ScholarFindings
  • Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnx00EDk, Bas R. Steunebrink, and Jx00FCrgen Schmidhuber. 20Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
    Google ScholarLocate open access versionFindings
  • Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role labeling: What works and what’s next. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9.
    Google ScholarFindings
  • Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.
    Google ScholarLocate open access versionFindings
  • Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Yoon Kim, Yacine Jernite, David A Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations.
    Google ScholarLocate open access versionFindings
  • Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for lstm networks. CoRR, abs/1703.10722.
    Findings
  • Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. 2015. A simple way to initialize recurrent networks of rectified linear units. CoRR, abs/1504.00941.
    Findings
  • Kenton Lee, Omer Levy, and Luke S. Zettlemoyer. 2017. Recurrent additive networks. CoRR, abs/1705.07393.
    Findings
  • Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2017. Deriving neural architectures from sequence and graph kernels. International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Liangyou Li, Xiaofeng Wu, Santiago Cortes Vaillo, Jun Xie, Andy Way, and Qun Liu. 2014. The DCUICTCAS MT system at WMT 2014 on germanenglish translation task. In Proceedings of the Ninth Workshop on Statistical Machine Translation.
    Google ScholarLocate open access versionFindings
  • Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the international conference on Computational linguistics-Volume 1. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hongyuan Mei, Mohit Bansal, and R. Matthew Walter. 2016. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    Google ScholarLocate open access versionFindings
  • Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. CoRR, abs/1707.05589.
    Findings
  • Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. CoRR, abs/1803.08240.
    Findings
  • Yajie Miao, Jinyu Li, Yongqiang Wang, Shi-Xiong Zhang, and Yifan Gong. 2016. Simplifying long short-term memory acoustic models for fast training and decoding. In IEEE International Conference on Acoustics, Speech and Signal Processing.
    Google ScholarLocate open access versionFindings
  • Dipendra Misra, John Langford, and Yoav Artzi. 2017. Mapping instructions and visual observations to actions with reinforcement learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Asier Mujika, Florian Meier, and Angelika Steger. 2017. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the annual meeting on Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the annual meeting on association for computational linguistics.
    Google ScholarLocate open access versionFindings
  • Stephan Peitz, Joern Wuebker, Markus Freitag, and Hermann Ney. 2014. The RWTH aachen germanenglish machine translation system for wmt 2014. In Proceedings of the Ninth Workshop on Statistical Machine Translation.
    Google ScholarLocate open access versionFindings
  • Hao Peng, Roy Schwartz, Sam Thomson, and Noah A. Smith. 2018. Rational recurrences. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
    Findings
  • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
    Findings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Advances in neural information processing systems.
    Google ScholarFindings
  • Alane Suhr and Yoav Artzi. 2018. Situated mapping of sequential instructions to actions with single-step reward observation. In Proceedings of the Annual
    Google ScholarLocate open access versionFindings
  • Meeting of the Association for Computational Linguistics.
    Google ScholarFindings
  • Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018. Learning to map context-dependent sentences to executable formal queries. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Sida Wang and Christopher Manning. 2013. Fast dropout training. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation.
    Google ScholarFindings
  • Huijia Wu, Jiajun Zhang, and Chengqing Zong. 2016a. An empirical exploration of skip connections for sequential tagging. In Proceedings of the International Conference on Computational Linguisticss.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, ÅAukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016b. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
    Findings
  • Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. 2016c. On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329.
    Findings
  • Yingjie Zhang and Byron C. Wallace. 2017. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In Proceedings of the International Joint Conference on Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Yuchen Zhang, Jason D. Lee, and Michael I. Jordan. 2016. `1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptive hierarchical sentence model. In Proceedings of the International Joint Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2017. Recurrent highway networks. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578.
    Findings
Your rating :
0

 

Tags
Comments