Language Model Prior for Low Resource Neural Machine Translation

Baziotis Christos
Baziotis Christos

EMNLP 2020, pp. 7622-7634, 2020.

Cited by: 0|Bibtex|Views151|DOI:https://doi.org/10.18653/V1/2020.EMNLP-MAIN.615
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We avoid the translation errors introduced by Language models-fusion, because the translation model is able to deviate from the prior when needed

Abstract:

The scarcity of large parallel corpora is an important obstacle for neural machine translation. A common solution is to exploit the knowledge of language models (LM) trained on abundant monolingual data. In this work, we propose a novel approach to incorporate a LM as prior in a neural translation model (TM). Specifically, we add a regula...More

Code:

Data:

0
Introduction
  • Neural machine translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Vaswani et al, 2017) relies heavily on large parallel corpora (Koehn and Knowles, 2017) and needs careful hyperparameter tuning, in order to work in low-resource settings (Sennrich and Zhang, 2019).
  • Language models (LM) trained on target-side monolingual data have been used for years as priors in statistical machine translation (SMT) (Brown et al, 1993) via the noisy channel model
  • This approach has been adopted to NMT, with the neural noisy channel (Yu et al, 2017; Yee et al, 2019).
  • Neural noisy channel models face a computational challenge, because they model the “reverse translation probability” p(x|y)
  • They require multiple passes over the source sentence x as they generate the target sentence y, or sophisticated architectures to reduce the passes.
  • Since the LM is part of the network it has to be used during inference, which adds a computational constraint on its size
Highlights
  • Neural machine translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Vaswani et al, 2017) relies heavily on large parallel corpora (Koehn and Knowles, 2017) and needs careful hyperparameter tuning, in order to work in low-resource settings (Sennrich and Zhang, 2019)
  • By increasing τ we expose extra information to the translation model (TM), because we reveal more low-probability words that the Language models (LM) found similar to the predicted word
  • We present a simple approach for incorporating knowledge from monolingual data to NMT
  • We use a LM trained on targetside monolingual data, to regularize the output distributions of a TM
  • This method is more efficient than alternative approaches that used pretrained LMs, because it is not required during inference
  • We avoid the translation errors introduced by LM-fusion, because the TM is able to deviate from the prior when needed
Results
  • The authors use in all methods LMs trained on the same amount of monolingual data, which is 3M sentences.
  • It yields up to +1.8 BLEU score gains over the strongest baseline “Base+LS” (DE→EN and EN→DE)
  • This shows that the proposed approach yields clear improvements, even with limited monolingual data (3M).
  • Penalizing confidence helps up to a point, which is shown by the performance gap between “Base+LS” and “Base+prior”.
  • The authors explore this further (§ 5)
Conclusion
  • The authors present a simple approach for incorporating knowledge from monolingual data to NMT.
  • The authors use a LM trained on targetside monolingual data, to regularize the output distributions of a TM
  • This method is more efficient than alternative approaches that used pretrained LMs, because it is not required during inference.
  • The authors empirically show that while this method works by changing the training objective, it achieves better results than alternative LM-fusion techniques
  • It yields consistent performance gains even with modest monolingual data (3M sentences) across all translation directions.
  • This makes it useful for low-resource languages, where parallel and monolingual data are scarce
Summary
  • Introduction:

    Neural machine translation (NMT) (Sutskever et al, 2014; Bahdanau et al, 2015; Vaswani et al, 2017) relies heavily on large parallel corpora (Koehn and Knowles, 2017) and needs careful hyperparameter tuning, in order to work in low-resource settings (Sennrich and Zhang, 2019).
  • Language models (LM) trained on target-side monolingual data have been used for years as priors in statistical machine translation (SMT) (Brown et al, 1993) via the noisy channel model
  • This approach has been adopted to NMT, with the neural noisy channel (Yu et al, 2017; Yee et al, 2019).
  • Neural noisy channel models face a computational challenge, because they model the “reverse translation probability” p(x|y)
  • They require multiple passes over the source sentence x as they generate the target sentence y, or sophisticated architectures to reduce the passes.
  • Since the LM is part of the network it has to be used during inference, which adds a computational constraint on its size
  • Results:

    The authors use in all methods LMs trained on the same amount of monolingual data, which is 3M sentences.
  • It yields up to +1.8 BLEU score gains over the strongest baseline “Base+LS” (DE→EN and EN→DE)
  • This shows that the proposed approach yields clear improvements, even with limited monolingual data (3M).
  • Penalizing confidence helps up to a point, which is shown by the performance gap between “Base+LS” and “Base+prior”.
  • The authors explore this further (§ 5)
  • Conclusion:

    The authors present a simple approach for incorporating knowledge from monolingual data to NMT.
  • The authors use a LM trained on targetside monolingual data, to regularize the output distributions of a TM
  • This method is more efficient than alternative approaches that used pretrained LMs, because it is not required during inference.
  • The authors empirically show that while this method works by changing the training objective, it achieves better results than alternative LM-fusion techniques
  • It yields consistent performance gains even with modest monolingual data (3M sentences) across all translation directions.
  • This makes it useful for low-resource languages, where parallel and monolingual data are scarce
Tables
  • Table1: Dataset statistics after preprocessing
  • Table2: Hyperparameters of the TMs and LMs
  • Table3: Perplexity scores for LMs trained on each language’s monolingual data, computed on a small heldout validation set per language
  • Table4: BLEU scores of each model. Mean and stdev of 3 runs reported. The top section contains the main results, where all methods use LMs trained on the same amount of data (3M). The bottom section compares different configurations of the LM-prior. Underlined scores denote gains over the “Base + Prior (3M)” model
  • Table5: Hyperparameters of RNN-based TMs and LMs
  • Table6: Perplexity (PPL ↓) scores for LMs trained on each language’s monolingual data, computed on a small held-out validation set per language
  • Table7: BLEU scores of each RNN-NMT method. Mean and standard deviation of 3 runs reported
Download tables as Excel
Related work
  • Most recent related work considers large pretrained models, either via transfer-learning or featurefusion. Zhu et al (2020); Clinchant et al (2019); Imamura and Sumita (2019) explore combinations of using BERT as initialization for NMT, or adding BERT’s representations as extra features. Yang et al (2019) address the problem of catastrophic-forgetting while transferring BERT in high-resource settings, with a sophisticated finetuning approach. In concurrent work, Chen et al (2019) propose knowledge-distillation using BERT for various text generation tasks, including NMT, by incentivizing the sequence-to-sequence models to “look into the future”. However, in our work we address a different problem (low-resource NMT) and have different motivation. Also, we consider auto-regressive LMs as priors, which have clear interpretation, unlike BERT that is not strictly a LM and requires bidirectional context. Note that, large pretrained LMs, such as BERT or GPT-2, have not yet achieved the transformative results in NMT that we observe in natural language understanding tasks (e.g., GLUE benchmark (Wang et al, 2019)).
Funding
  • This work was conducted within the scope of the Research and Innovation Action Gourmet, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299
  • It was also supported by the UK Engineering and Physical Sciences Research Council fellowship grant EP/S001271/1 (MTStretch)
  • It was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (http://www. csd3.cam.ac.uk/), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk)
Study subjects and analysis
articles: 2016
We use the official WMT-2017 and 2018 test sets as the development and test set, respectively. As monolingual data for English and German we use the News Crawls 2016 articles (Bojar et al, 2016) and for Turkish we concatenate all the available News Crawls data from 2010-2018, which contain 3M sentences. For English and German we subsample 3M sentences to match the Turkish data, as well as 30M to measure the effect of stronger LMs

Reference
  • Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Proceedings of the Advances in Neural Information Processing Systems, pages 2654–2662, Montreal, Quebec, Canada.
    Google ScholarLocate open access versionFindings
  • Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450, abs/1607.06450.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA.
    Google ScholarLocate open access versionFindings
  • Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, and Alexandros Potamianos. 2019. SEQ3: Differentiable sequence-to-sequence-to-sequence autoencoder for unsupervised abstractive sentence compression. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 673–681, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the Conference on Machine Translation, pages 131–198, Berlin, Germany.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and Christof Monz. 2018. Findings of the conference on machine translation (WMT). In Proceedings of the Conference on Machine Translation, pages 272–303, Belgium, Brussels.
    Google ScholarLocate open access versionFindings
  • Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263– 311.
    Google ScholarLocate open access versionFindings
  • Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, Philadelphia, PA, USA.
    Google ScholarLocate open access versionFindings
  • Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu. 201Distilling the knowledge of bert for text generation. arXiv preprint arXiv:1911.03829.
    Findings
  • Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4794–4802.
    Google ScholarLocate open access versionFindings
  • Stephane Clinchant, Kweon Woo Jung, and Vassilina Nikoulina. 2019. On the use of BERT for neural machine translation. In Proceedings of the Workshop on Neural Generation and Translation, pages 108– 117, Stroudsburg, PA, USA.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, Minnesota.
    Google ScholarLocate open access versionFindings
  • Tobias Domhan and Felix Hieber. 2017. Using targetside monolingual data for neural machine translation through multi-task learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1500–1505, Copenhagen, Denmark.
    Google ScholarLocate open access versionFindings
  • Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11(Jul):2001–2049.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy. PMLR.
    Google ScholarLocate open access versionFindings
  • Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
    Findings
  • Serhii Havrylov and Ivan Titov. 20Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Proceedings of the Advances in Neural Information Processing Systems, pages 2149–2159.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, abs/1503.02531.
    Findings
  • Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018. Iterative backtranslation for neural machine translation. In Proceedings of the Workshop on Neural Machine Translation and Generation, pages 18–24, Melbourne, Australia.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Kenji Imamura and Eiichiro Sumita. 2019. Recycling a pre-trained BERT encoder for neural machine translation. In Proceedings of the 3rd Workshop on
    Google ScholarLocate open access versionFindings
  • Neural Generation and Translation, pages 23–31, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hakan Inan, Khashayar Khosravi, and Richard Socher. 2017. Tying word vectors and word classifiers: A loss framework for language modeling. In Proceedings of the International Conference on Learning Representations, Toulon, France.
    Google ScholarLocate open access versionFindings
  • Yoon Kim and Alexander M. Rush. 2016. Sequencelevel knowledge distillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas, USA.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2010. Statistical Machine Translation, 1st edition. Cambridge University Press, New York, NY, USA.
    Google ScholarFindings
  • Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the Workshop on Neural Machine Translation, pages 28–39, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Julia Kreutzer, Joost Bastings, and Stefan Riezler. 2019. Joey NMT: A minimalist NMT toolkit for novices. To Appear in EMNLP-IJCNLP 2019: System Demonstrations.
    Google ScholarFindings
  • Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 66–71, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal.
    Google ScholarLocate open access versionFindings
  • Yishu Miao and Phil Blunsom. 2016. Language as a Latent Variable: Discrete Generative Models for Sentence Compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 319–328, Austin, Texas.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. Rnnlmrecurrent neural network language modeling toolkit. In Proceedings of the ASRU Workshop, pages 196– 201.
    Google ScholarLocate open access versionFindings
  • Rafael Muller, Simon Kornblith, and Geoffrey E Hinton. 2019. When does label smoothing help? pages 4696–4705.
    Google ScholarFindings
  • Toan Q. Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of self-attention.
    Google ScholarFindings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Proceedings of the Advances in Neural Information Processing Systems, pages 8024–8035.
    Google ScholarLocate open access versionFindings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Conference on Machine Translation, pages 186–191, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, pages 157–163, Valencia, Spain.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Prajit Ramachandran, Peter Liu, and Quoc Le. 2017. Unsupervised pretraining for sequence to sequence learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 383–391, Copenhagen, Denmark.
    Google ScholarLocate open access versionFindings
  • Shuo Ren, Zhirui Zhang, Shujie Liu, Ming Zhou, and Shuai Ma. 2019. Unsupervised neural machine translation with smt as posterior regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 241–248.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 86–96, Berlin, Germany.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich and Biao Zhang. 2019. Revisiting lowresource neural machine translation: A case study. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 211–221, Florence, Italy.
    Google ScholarLocate open access versionFindings
  • Claude E Shannon and Warren Weaver. 1949. The mathematical theory of communication. Urbana, 117.
    Google ScholarFindings
  • Ivan Skorokhodov, Anton Rykachevskiy, Dmitry Emelyanenko, Sergey Slotin, and Anton Ponkratov. 2018. Semi-supervised neural machine translation with language models. In Proceedings of the AMTA Workshop on Technologies for MT of Low Resource Languages, pages 37–44, Boston, MA. Association for Machine Translation in the Americas.
    Google ScholarLocate open access versionFindings
  • Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. 2018. Cold fusion: Training seq2seq models together with language models. In Proceedings of Interspeech, pages 387–391.
    Google ScholarLocate open access versionFindings
  • Felix Stahlberg, James Cross, and Veselin Stoyanov. 2018. Simple fusion: Return of the language model. In Proceedings of the Conference on Machine Translation, pages 204–211, Belgium, Brussels.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pages 5998–6008, Long Beach, CA, USA.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer normalization in the transformer architecture.
    Google ScholarFindings
  • Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang, and Lei Li. 2019. Towards making the most of BERT in neural machine translation.
    Google ScholarFindings
  • Kyra Yee, Yann Dauphin, and Michael Auli. 2019. Simple and effective noisy channel modeling for neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, pages 5700–5705, Hong Kong, China.
    Google ScholarLocate open access versionFindings
  • Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. 2017. The neural noisy channel.
    Google ScholarFindings
  • Lei Yu, Laurent Sartran, Wojciech Stokowiec, Wang Ling, Lingpeng Kong, Phil Blunsom, and Chris Dyer. 2019. Putting machine translation in context with the noisy channel model.
    Google ScholarFindings
  • Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong Sun. 2017. Prior knowledge integration for neural machine translation using posterior regularization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1514–1523, Vancouver, Canada.
    Google ScholarLocate open access versionFindings
  • Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou. 2019. Extreme language model compression with optimal subwords and shared projections.
    Google ScholarFindings
  • Chunting Zhou, Jiatao Gu, and Graham Neubig. 2020. Understanding knowledge distillation in nonautoregressive machine translation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020. Incorporating bert into neural machine translation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Model Configuration We employ the attentional encoder-decoder (Bahdanau et al., 2015) architecture, using the “global” attention mechanism (Luong et al., 2015). The recurrent cells are implemented using Long short-term memory (LSTM; Hochreiter and Schmidhuber (1997)) units. We use a bidirectional LSTM encoder and a unidirectional LSTM decoder. We also tie the embedding and the output (projection) layers of the decoders (Press and Wolf, 2017; Inan et al., 2017). and apply layer normalization (Ba et al., 2016) to the last decoder representation, before the softmax.
    Google ScholarLocate open access versionFindings
  • We did not do any hyperparameter tuning, but selected the hyper-parameter values based on Sennrich and Zhang (2019), while also trying to keep approximately the same number of parameters as their Transformer-based counterparts. Table 5 lists all the model hyperparameters. All models were optimized with the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.0002 and with mini-batches with 2000 tokens per batch.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments