AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showed that a large deep Long Short-Term Memory with a limited vocabulary can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task

Sequence to Sequence Learning with Neural Networks.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), (2014): 3104-3112

Cited by: 14668|Views697
EI
Full Text
Bibtex
Weibo

Abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal...More

Code:

Data:

Introduction
  • Deep Neural Networks (DNNs) are extremely powerful machine learning models that achieve excellent performance on difficult problems such as speech recognition [13, 7] and visual object recognition [19, 6, 21, 20].
  • If there exists a parameter setting of a large DNN that achieves good results, supervised backpropagation will find these parameters and solve the problem
Highlights
  • Deep Neural Networks (DNNs) are extremely powerful machine learning models that achieve excellent performance on difficult problems such as speech recognition [13, 7] and visual object recognition [19, 6, 21, 20]
  • Large Deep Neural Networks can be trained with supervised backpropagation whenever the labeled training set has enough information to specify the network’s parameters
  • Our best results are obtained with an ensemble of Long Short-Term Memory that differ in their random initializations and in the random order of minibatches
  • We showed that a large deep Long Short-Term Memory with a limited vocabulary can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task
  • We were initially convinced that the Long Short-Term Memory would fail on long sentences due to its limited memory, and other researchers reported poor performance on long sentences with a model similar to ours [5, 2, 26]
  • Long Short-Term Memory trained on the reversed dataset had little difficulty translating long sentences
Methods
  • Single forward LSTM, beam size 12 Single reversed LSTM, beam size 12 Ensemble of 5 reversed LSTMs, beam size 1 Ensemble of 2 reversed LSTMs, beam size 12 Ensemble of 5 reversed LSTMs, beam size 2 Ensemble of 5 reversed LSTMs, beam size 12 test BLEU score 28.45 33.30.
  • Cho et al [5] State of the art [9].
  • The LSTM is within 0.5 BLEU points of the previous state of the art by rescoring the 1000-best list of the baseline system.
Results
  • The authors computed the BLEU scores using multi-bleu.pl1 on the tokenized predictions and ground truth.
  • This way of evaluating the BELU score is consistent with [5] and [2], and reproduces the 33.3 score of [29].
  • While the decoded translations of the LSTM ensemble do not beat the state of the art, it is the first time that a pure neural translation system outperforms a phrase-based SMT baseline on a large MT task by
Conclusion
  • The authors showed that a large deep LSTM with a limited vocabulary can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task.
  • The success of the simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
  • While the authors were unable to train a standard RNN on the non-reversed translation problem, the authors believe that a standard RNN should be trainable when the source sentences are reversed.
  • The authors were surprised by the ability of the LSTM to correctly translate very long sentences.
  • LSTMs trained on the reversed dataset had little difficulty translating long sentences
Summary
  • Deep Neural Networks (DNNs) are extremely powerful machine learning models that achieve excellent performance on difficult problems such as speech recognition [13, 7] and visual object recognition [19, 6, 21, 20].
  • The second LSTM is essentially a recurrent neural network language model [28, 23, 30] except that it is conditioned on the input sequence.
  • A useful property of the LSTM is that it learns to map an input sentence of variable length into a fixed-dimensional vector representation.
  • We used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously [18].
  • While the decoded translations of the LSTM ensemble do not beat the state of the art, it is the first time that a pure neural translation system outperforms a phrase-based SMT baseline on a large MT task by
  • Feedforward Neural Network Language Model (NNLM) [3] to an MT task is by rescoring the nbest lists of a strong MT baseline [22], which reliably improves translation quality.
  • Examples of this work include Auli et al [1], who combine an NNLM with a topic model of the input sentence, which improves rescoring performance.
  • Devlin et al [8] followed a similar approach, but they incorporated their NNLM into the decoder of an MT system and used the decoder’s alignment information to provide the NNLM with the most useful words in the input sentence.
  • Cho et al [5] used an LSTM-like RNN architecture to map sentences into vectors and back, their primary focus was on integrating their neural network into an SMT system.
  • Bahdanau et al [2] attempted direct translations with a neural network that used an attention mechanism to overcome the poor performance on long sentences experienced by Cho et al [5] and achieved encouraging results.
  • Pouget-Abadie et al [26] attempted to address the memory problem of Cho et al [5] by translating pieces of the source sentence in way that produces smooth translations, which is similar to a phrase-based approach.
  • End-to-end training is the focus of Hermann et al [12], whose model represents the inputs and outputs by feedforward networks, and map them to similar points in space.
  • The success of our simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
  • LSTMs trained on the reversed dataset had little difficulty translating long sentences.
  • While we were unable to train a standard RNN on the non-reversed translation problem, we believe that a standard RNN should be trainable when the source sentences are reversed
Tables
  • Table1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note that an ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of size 12
  • Table2: Methods that use neural networks together with an SMT system on the WMT’14 English to French test set (ntst14)
  • Table3: A few examples of long translations produced by the LSTM alongside the ground truth translations. The reader can verify that the translations are sensible using Google translate
Download tables as Excel
Related work
  • There is a large body of work on applications of neural networks to machine translation. So far, the simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a

    Feedforward Neural Network Language Model (NNLM) [3] to an MT task is by rescoring the nbest lists of a strong MT baseline [22], which reliably improves translation quality.

    More recently, researchers have begun to look into ways of including information about the source language into the NNLM. Examples of this work include Auli et al [1], who combine an NNLM with a topic model of the input sentence, which improves rescoring performance. Devlin et al [8] followed a similar approach, but they incorporated their NNLM into the decoder of an MT system and used the decoder’s alignment information to provide the NNLM with the most useful words in the input sentence. Their approach was highly successful and it achieved large improvements over their baseline.
Funding
  • Presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure
  • Shows that a straightforward application of the Long Short-Term Memory architecture can solve general sequence to sequence problems
  • Introduced many short term dependencies that made the optimization problem much simpler
  • Found it extremely valuable to reverse the order of the words of the input sentence. This way, a is in close proximity to α, b is fairly close to β, and so on, a fact that makes it easy for SGD to “establish communication” between the input and the output. found this simple data transformation to greatly boost the performance of the LSTM
  • Reports the accuracy of these translation methods, present sample translations, and visualize the resulting sentence representation
Reference
  • M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent neural networks. In EMNLP, 2013.
    Google ScholarFindings
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In Journal of Machine Learning Research, pages 1137–1155, 2003.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
    Google ScholarLocate open access versionFindings
  • K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078, 2014.
    Findings
  • D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing - Special Issue on Deep Learning for Speech and Language Processing, 2012.
    Google ScholarLocate open access versionFindings
  • J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network joint models for statistical machine translation. In ACL, 2014.
    Google ScholarLocate open access versionFindings
  • Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine translation systems for wmt-14. In WMT, 2014.
    Google ScholarLocate open access versionFindings
  • A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850, 2013.
    Findings
  • A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006.
    Google ScholarLocate open access versionFindings
  • K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master’s thesis, Institut fur Informatik, Technische Universitat, Munchen, 1991.
    Google ScholarFindings
  • S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
    Google ScholarFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. 1997.
    Google ScholarLocate open access versionFindings
  • N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. Technology, 2012.
    Google ScholarLocate open access versionFindings
  • [23] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048, 2010.
    Google ScholarLocate open access versionFindings
  • [24] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.
    Google ScholarLocate open access versionFindings
  • [25] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.
    Findings
  • [26] J. Pouget-Abadie, D. Bahdanau, B. van Merrienboer, K. Cho, and Y. Bengio. Overcoming the curse of sentence length for neural machine translation using automatic segmentation. arXiv preprint arXiv:1409.1257, 2014.
    Findings
  • [27] A. Razborov. On small depth threshold circuits. In Proc. 3rd Scandinavian Workshop on Algorithm Theory, 1992.
    Google ScholarLocate open access versionFindings
  • [28] D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
    Google ScholarLocate open access versionFindings
  • [29] H. Schwenk. University le mans. http://www-lium.univ-lemans.fr/̃schwenk/cslm_joint_paper/, 2014.[Online; accessed 03-September-2014].
    Findings
  • [30] M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. In INTERSPEECH, 2010.
    Google ScholarLocate open access versionFindings
  • [31] P. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of IEEE, 1990.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科