Incorporating BERT into Neural Machine Translation

ICLR, 2020.

Cited by: 9|Bibtex|Views481
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We propose an effective approach, BERT-fused model, to combine BERT and Neural Machine Translation, where the BERT is leveraged by the encoder and decoder through attention models

Abstract:

The recently proposed BERT (Devlin et al., 2019) has shown great power on a variety of natural language understanding tasks, such as text classification, reading comprehension, etc. However, how to effectively apply BERT to neural machine translation (NMT) lacks enough exploration. While BERT is more commonly used as fine-tuning instead o...More
Introduction
  • Pre-training techniques, like ELMo (Peters et al, 2018), GPT/GPT-2 (Radford et al, 2018; 2019), BERT (Devlin et al, 2019), cross-lingual language model (Lample & Conneau, 2019), XLNet (Yang et al, 2019b) and RoBERTa (Liu et al, 2019) have attracted more and more attention in machine learning and natural language processing communities.
  • The models are first pre-trained on large amount of unlabeled data to capture rich representations of the input, and applied to the downstream tasks by either providing context-aware embeddings of an input sequence (Peters et al, 2018), or initializing the parameters of the downstream model (Devlin et al, 2019) for fine-tuning.
  • The authors will use Transformer as the basic architecture of the model
Highlights
  • Recently, pre-training techniques, like ELMo (Peters et al, 2018), GPT/GPT-2 (Radford et al, 2018; 2019), BERT (Devlin et al, 2019), cross-lingual language model (Lample & Conneau, 2019), XLNet (Yang et al, 2019b) and RoBERTa (Liu et al, 2019) have attracted more and more attention in machine learning and natural language processing communities
  • An Neural Machine Translation model usually consists of an encoder to map an input sequence to hidden representations, and a decoder to decode hidden representations to generate a sentence in the target language
  • Inspired by dropout (Srivastava et al, 2014) and drop-path (Larsson et al, 2017), which can regularize the network training, we propose a drop-net trick to ensure that the features output by BERT and the conventional encoder are fully utilized
  • Our approach is a more effective way of leveraging the features from the pre-trained model: (1) The output features of the pre-trained model are fused in all layers of the Neural Machine Translation module, ensuring the well-pre-trained features are fully exploited; (2) We use the attention model to bridge the Neural Machine Translation module and the pre-trained features of BERT, in which the Neural Machine Translation module can adaptively determine how to leverage the features from BERT
  • With our proposed BERT-fused model, we can achieve 38.27, 35.62, 36.02 and 33.20 BLEU scores on the four tasks, setting stateof-the-art results on these tasks
  • We propose an effective approach, BERT-fused model, to combine BERT and Neural Machine Translation, where the BERT is leveraged by the encoder and decoder through attention models
Results
  • The results of IWSLT translation tasks are reported in Table 2.
  • The authors' proposed BERT-fused model can improve the BLEU scores of the five tasks by 1.88, 1.47, 2.4, 1.9 and 2.8 points respectively, demonstrating the effectiveness of the method.
  • The authors can see that introducing contextual information from an additional encoder can boost the sentence-level baselines, but the improvement is limited (0.33 for En→De and 0.31 for De→En).
  • Combining BERT-fused model and document-level information, the authors can eventually achieve 31.02 for En→De and 36.69 for De→En. The authors perform significant test1 between sentence-level and document-level translation.
Conclusion
  • Comparison with ELMo As introduced in Section 2, ELMo (Peters et al, 2018) provides a contextaware embeddings for the encoder in order to capture richer information of the input sequence.
  • The authors propose an effective approach, BERT-fused model, to combine BERT and NMT, where the BERT is leveraged by the encoder and decoder through attention models.
  • There are some contemporary works leveraging knowledge distillation to combine pre-trained models with NMT (Yang et al, 2019a; Chen et al, 2019), which is a direction to explore
Summary
  • Introduction:

    Pre-training techniques, like ELMo (Peters et al, 2018), GPT/GPT-2 (Radford et al, 2018; 2019), BERT (Devlin et al, 2019), cross-lingual language model (Lample & Conneau, 2019), XLNet (Yang et al, 2019b) and RoBERTa (Liu et al, 2019) have attracted more and more attention in machine learning and natural language processing communities.
  • The models are first pre-trained on large amount of unlabeled data to capture rich representations of the input, and applied to the downstream tasks by either providing context-aware embeddings of an input sequence (Peters et al, 2018), or initializing the parameters of the downstream model (Devlin et al, 2019) for fine-tuning.
  • The authors will use Transformer as the basic architecture of the model
  • Results:

    The results of IWSLT translation tasks are reported in Table 2.
  • The authors' proposed BERT-fused model can improve the BLEU scores of the five tasks by 1.88, 1.47, 2.4, 1.9 and 2.8 points respectively, demonstrating the effectiveness of the method.
  • The authors can see that introducing contextual information from an additional encoder can boost the sentence-level baselines, but the improvement is limited (0.33 for En→De and 0.31 for De→En).
  • Combining BERT-fused model and document-level information, the authors can eventually achieve 31.02 for En→De and 36.69 for De→En. The authors perform significant test1 between sentence-level and document-level translation.
  • Conclusion:

    Comparison with ELMo As introduced in Section 2, ELMo (Peters et al, 2018) provides a contextaware embeddings for the encoder in order to capture richer information of the input sequence.
  • The authors propose an effective approach, BERT-fused model, to combine BERT and NMT, where the BERT is leveraged by the encoder and decoder through attention models.
  • There are some contemporary works leveraging knowledge distillation to combine pre-trained models with NMT (Yang et al, 2019a; Chen et al, 2019), which is a direction to explore
Tables
  • Table1: Preliminary explorations on IWSLT’14 English→German translation
  • Table2: BLEU of all IWSLT tasks. Transformer BERT-fused
  • Table3: BLEU scores of WMT’14 translation
  • Table4: BLEU of document-level translation. En→De De→En
  • Table5: BLEU scores of WMT’16 Ro→En
  • Table6: Ablation study on IWSLT’14 En→De
  • Table7: BLEU scores of unsupervised NMT
  • Table8: More ablation study on IWSLT’14 En→De
  • Table9: More ablation study on IWSLT’14 De→En
  • Table10: BLEU scores of IWSLT translation tasks. En→De De→En En→Es
  • Table11: Previous results of IWSLT’14 De→En
  • Table12: BLEU scores IWSLT’14 En←De by BT
  • Table13: Comparisons on inference time (seconds), ‘+’ is the increased ratio of inference time
Download tables as Excel
Funding
  • Proposes a new algorithm named BERT-fused model, in which firsts use BERT to extract representations for an input sequence, and the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms
  • Uses BERT to provide context-aware embeddings for the NMT model. finds that this strategy outperforms the first one
  • Proposes a new algorithm, BERT-fused model, in which exploits the representation from BERT by feeding it into all layers rather than served as input embeddings only
  • Introduces the background of NMT and review current pre-training techniques
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 6th International Conference on Learning Representations, 2015. URL https://arxiv.org/pdf/1409.0473v7.pdf.
    Findings
  • Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Muller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61, Florence, Italy, August 2019. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W19-5301.
    Locate open access versionFindings
  • Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, pp. 57, 2014.
    Google ScholarLocate open access versionFindings
  • Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu. Distilling the knowledge of bert for text generation. arXiv preprint arXiv:1911.03829, 2019.
    Findings
  • Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087, 2015.
    Google ScholarLocate open access versionFindings
  • Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, pp. 9712–9724, 2018.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. URL https://arxiv.org/pdf/1810.04805.pdf.
    Findings
  • Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marcaurelio Ranzato. Classical structured prediction losses for sequence to sequence learning. NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Artificial Intelligence and Statistics, pp. 153–160, 2009.
    Google ScholarLocate open access versionFindings
  • Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735– 1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.8.1735.
    Locate open access versionFindings
  • Marcin Junczys-Dowmunt and Roman Grundkiewicz. Ms-uedin submission to the wmt2018 ape shared task: Dual-source transformer for automatic post-editing. EMNLP 2018 THIRD CONFERENCE ON MACHINE TRANSLATION (WMT18), 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755, 2018.
    Findings
  • Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. ICLR, 20URL https://arxiv.org/pdf/1605.07648.pdf.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. Document-level neural machine translation with hierarchical attention networks. arXiv preprint arXiv:1809.01576, 2018.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. EMNLP 2018 third conference on machine translation (WMT18), 2018.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
    Findings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/languageunsupervised/language understanding paper.pdf, 2018.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for wmt 16. In Proceedings of the First Conference on Machine Translation, volume 2, pp. 371– 376, 2016a. URL http://www.statmt.org/wmt16/pdf/W16-2323.pdf.
    Locate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. ACL, 2016b. URL https://aclweb.org/anthology/ P16-1009.
    Locate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ACL, 2016c.
    Google ScholarFindings
  • David So, Quoc Le, and Chen Liang. The evolved transformer. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5877–5886, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/so19a.html.
    Locate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5926–5936, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/song19d.html.
    Locate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Multiagent dual learning. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Dirk Weissenborn, Douwe Kiela, Jason Weston, and Kyunghyun Cho. Contextualized role interaction for neural machine translation, 2019. URL https://openreview.net/forum?id=ryx3_iAcY7.
    Findings
  • Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkVhlh09tX.
    Locate open access versionFindings
  • Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems, pp. 6466–6477, 2018.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
  • Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. Tied transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5466–5473, 2019.
    Google ScholarLocate open access versionFindings
  • Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang, and Lei Li. Towards making the most of bert in neural machine translation. arXiv preprint arXiv:1908.05672, 2019a. URL https://arxiv.org/pdf/1908.05672.pdf.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019b.
    Findings
  • The IWSLT’14 English-to-German data and model configuration is introduced in Section A.1. For the training stategy, we use Adam (Kingma & Ba, 2014) to optimize the network with β1 = 0.9, β2 = 0.98 and weight-decay = 0.0001. The learning rate scheduler is inverse sqrt, where warmup-init-lr = 10−7, warmup-updates = 4000 and max-lr = 0.0005.
    Google ScholarFindings
  • We leverage one Transformer model with GELU activation function to work on translations of two directions, where each language is associated with a language tag. The embedding dimension, FFN layer dimension and number of layer are 1024, 4096 and 6. The BERT is initialized by the pretrained XLM model provided by (Lample & Conneau, 2019).
    Google ScholarFindings
  • We use XLM to initialize the model for WMT’14 English→German translation task, whose training corpus is relative large. We eventually obtain 28.09 after 90 epochs, which is still underperform the baseline, 29.12 as we got. Similar problem is also reported in https://github.com/facebookresearch/XLM/issues/32. We leave the improvement of supervised NMT with XLM as future work.
    Findings
  • Junczys-Dowmunt & Grundkiewicz (2018) proposed a new way to handle multiple attention models. Instead of using Eqn.(2), the input is processed by self-attention, encoder-decoder attention and BERT-decoder attention sequentially. Formally, slt = attnS (slt−1, S<l−t+1 1, S<l−t+1 1); slt = attnE(slt, HEL, HEL); slt = attnB(slt, HB, HB); (5)
    Google ScholarLocate open access versionFindings
  • (2) We also compare the results of our approach with ensemble methods. To get an M -model ensemble, we independently train M models with different random seeds (M ∈ Z+). We ensemble both standard Transformers and our BERT-fused models, which are denoted as M -model ensemble (standard) and M -model ensemble (BERT-fused) respectively. Please note that when we aggregate multiple BERT-fused models, we only need to store one replica of the BERT model because the BERT part is not optimized.
    Google ScholarFindings
  • 2. We also compare our results to ensemble methods. Indeed, ensemble significantly boost the baseline by more than one point. However, even if using ensemble of four models, the BLEU score is still lower than our BERT-fused model (30.18 v.s. 30.45), which shows the effectiveness of our method.
    Google ScholarFindings
  • 1. In BT, the monolingual data from the target side is leveraged. In our proposed approach, we use a BERT of the source language, which indirectly leverages the monolingual data from the source side. In this way, our approach and BT are complementary to each other. In Section 5.4, we have already verified that our method can further improve the results of standard BT on Romanian-to-English translation.
    Google ScholarLocate open access versionFindings
  • 2. To use BT, we have to train a reversed translation model and then back translate the monolingual data, which is time-cost due to the decoding process. In BERT-fused model, we only need to download a pre-trained BERT model, incorporate it into our model and continue training. Besides, the BERT module is fixed during training.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments