Non-Autoregressive Machine Translation with Auxiliary Regularization

AAAI, 2019.

Cited by: 41|Bibtex|Views107
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We proposed two simple regularization strategies to improve the performance of non-autoregressive machine translation models, the similarity regularization and reconstruction regularization, which have been shown to be effective for addressing two major problems of Non-Autoregres...

Abstract:

As a new neural machine translation approach, Non-Autoregressive machine Translation (NAT) has attracted attention recently due to its high efficiency in inference. However, the high efficiency has come at the cost of not capturing the sequential dependency on the target side of translation, which causes NAT to suffer from two kinds of ...More

Code:

Data:

0
Introduction
  • Neural Machine Translation (NMT) based on deep neural networks has gained rapid progress over recent years (Cho et al 2014; Bahdanau, Cho, and Bengio 2014; Wu et al 2016; Vaswani et al 2017; Hassan et al 2018).
  • It comes at the cost that the translation quality is largely sacrificed since the intrinsic dependency within the natural language sentence is abandoned
  • To mitigate such performance degradation, previous work has tried different ways to insert intermediate discrete variables to the basic NAT model, so as to incorporate some lightweighted sequential information into the non-autoregressive decoder.
Highlights
  • Neural Machine Translation (NMT) based on deep neural networks has gained rapid progress over recent years (Cho et al 2014; Bahdanau, Cho, and Bengio 2014; Wu et al 2016; Vaswani et al 2017; Hassan et al 2018)
  • Neural Machine Translation systems are typically implemented in an encoder-decoder framework, in which the encoder network feeds the representations of source side sentence x into the decoder network to generate the tokens in target sentence y
  • The improvements over baseline model are not merely brought by the stronger teacher models. In both WMT tasks, our model achieves better performances with weaker autoregressive translation teacher model that is on par with the teacher model in previous works (e.g. 27.12 of Non-Autoregressive Translation-REG vs. 25.48 of Non-Autoregressive Translation-IR and 22.41 of Non-Autoregressive Translation-FT on WMT14 De-En)
  • Case Study We present several translation examples sampled from the IWSLT14 De-En dataset in Table 2, including the source sentence, the target reference, the translation given by the teacher model (AT), by the basic Non-Autoregressive Translation with sequence distillation (NATBASE), and by our Non-Autoregressive Translation with the two regularization terms (NAT-REG)
  • We proposed two simple regularization strategies to improve the performance of non-autoregressive machine translation (NAT) models, the similarity regularization and reconstruction regularization, which have been shown to be effective for addressing two major problems of Non-Autoregressive Translation models, i.e., the repeated translation and incomplete translation, respectively, leading to quite strong performance with fast decoding speed
  • We plan to break the upper bound of the autoregressive teacher model and obtain better performance than the autoregressive Neural Machine Translation model, which is possible since there is no gap between training and inference (i.e., the exposure bias problem for autoregressive sequence generation (Ranzato et al 2015)) in Non-Autoregressive Translation models
Results
  • The authors report all the results in Table 1, from which the authors can make the following conclusions: 1.
  • On all the benchmark datasets except for IWSLT16 En-De, the NAT-REG achieves the best translation quality.
  • 2. The improvements over baseline model are not merely brought by the stronger teacher models.
  • The improvements over baseline model are not merely brought by the stronger teacher models
  • In both WMT tasks, the model achieves better performances with weaker AT teacher model that is on par with the teacher model in previous works (e.g. 27.12 of NAT-REG vs 25.48 of NAT-IR and 22.41 of NAT-FT on WMT14 De-En).
  • It is clear that the proposal to replace hard-tooptimize discrete variables with the simple regularization terms, can help obtain better NAT models
Conclusion
  • While the two regularization strategies were proposed to improve NAT models, they may be generally applicable to other sequence generation models.
  • Exploring such potential would be an interesting direction for future research.
  • The authors plan to break the upper bound of the autoregressive teacher model and obtain better performance than the autoregressive NMT model, which is possible since there is no gap between training and inference (i.e., the exposure bias problem for autoregressive sequence generation (Ranzato et al 2015)) in NAT models
Summary
  • Introduction:

    Neural Machine Translation (NMT) based on deep neural networks has gained rapid progress over recent years (Cho et al 2014; Bahdanau, Cho, and Bengio 2014; Wu et al 2016; Vaswani et al 2017; Hassan et al 2018).
  • It comes at the cost that the translation quality is largely sacrificed since the intrinsic dependency within the natural language sentence is abandoned
  • To mitigate such performance degradation, previous work has tried different ways to insert intermediate discrete variables to the basic NAT model, so as to incorporate some lightweighted sequential information into the non-autoregressive decoder.
  • Results:

    The authors report all the results in Table 1, from which the authors can make the following conclusions: 1.
  • On all the benchmark datasets except for IWSLT16 En-De, the NAT-REG achieves the best translation quality.
  • 2. The improvements over baseline model are not merely brought by the stronger teacher models.
  • The improvements over baseline model are not merely brought by the stronger teacher models
  • In both WMT tasks, the model achieves better performances with weaker AT teacher model that is on par with the teacher model in previous works (e.g. 27.12 of NAT-REG vs 25.48 of NAT-IR and 22.41 of NAT-FT on WMT14 De-En).
  • It is clear that the proposal to replace hard-tooptimize discrete variables with the simple regularization terms, can help obtain better NAT models
  • Conclusion:

    While the two regularization strategies were proposed to improve NAT models, they may be generally applicable to other sequence generation models.
  • Exploring such potential would be an interesting direction for future research.
  • The authors plan to break the upper bound of the autoregressive teacher model and obtain better performance than the autoregressive NMT model, which is possible since there is no gap between training and inference (i.e., the exposure bias problem for autoregressive sequence generation (Ranzato et al 2015)) in NAT models
Tables
  • Table1: The test set performances of AT and NAT models in BLEU score. NAT-FT, NAT-IR and LT denotes the baseline method in (<a class="ref-link" id="cGu_et+al_2018_a" href="#rGu_et+al_2018_a">Gu et al 2018</a>), (<a class="ref-link" id="cLee_et+al_2018_a" href="#rLee_et+al_2018_a">Lee, Mansimov, and Cho 2018</a>) and (Kaiser et al 2018) respectively. NAT-REG is our proposed NAT with two regularization terms and Transformer (NAT-REG) is correspondingly our AT teacher model. ‘Weak Teacher’ or ‘WT’ refers to the NAT trained with weakened teacher comparable to prior works. All AT models are decoded with a beam size of 4. ‘†’ denotes baselines from our reproduction. ‘–’ denotes same numbers as above/below
  • Table2: Translation examples from IWSLT14 De-En task. The AT result is decoded with a beam size of 4 and NAT results are generated by re-scoring 9 candidates. We use the italic fonts to indicate the translation pieces where the NAT has the issue of incomplete translation, and bond fonts to indicate the issue of repeated translation
  • Table3: Ablation study on IWLST14 De-En dev set. Results are BLEU scores with teacher rescoring 9 candidates
Download tables as Excel
Reference
  • Anonymous. 2019. Hint-based training for nonautoregressive translation. In Submitted to International Conference on Learning Representations. under review.
    Google ScholarFindings
  • Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.
    Google ScholarLocate open access versionFindings
  • Doha, Qatar: Association for Computational Linguistics.
    Google ScholarFindings
  • Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
    Google ScholarLocate open access versionFindings
  • Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O.; and Socher, R. 2018. Non-autoregressive neural machine translation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C.; Huang, X.; Junczys-Dowmunt, M.; Lewis, W.; Li, M.; et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
    Findings
  • He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma, W.-Y. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, 820–828.
    Google ScholarLocate open access versionFindings
  • Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • Kaiser, L.; Bengio, S.; Roy, A.; Vaswani, A.; Parmar, N.; Uszkoreit, J.; and Shazeer, N. 2018. Fast decoding in sequence models using discrete latent variables. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, 2395–2404.
    Google ScholarLocate open access versionFindings
  • Kim, Y., and Rush, A. M. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1317– 1327.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; and Welling, M. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 4743–4751.
    Google ScholarLocate open access versionFindings
  • Lee, J.; Mansimov, E.; and Cho, K. 2018. Deterministic nonautoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.
    Findings
  • Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
    Google ScholarLocate open access versionFindings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
    Findings
  • Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1715–1725.
    Google ScholarLocate open access versionFindings
  • Tu, Z.; Liu, Y.; Lu, Z.; Liu, X.; and Li, H. 2016a. Context gates for neural machine translation. arXiv preprint arXiv:1608.06043.
    Findings
  • Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016b. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 76– 85.
    Google ScholarLocate open access versionFindings
  • Tu, Z.; Liu, Y.; Shang, L.; Liu, X.; and Li, H. 2017. Neural machine translation with reconstruction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 3097– 3103.
    Google ScholarLocate open access versionFindings
  • van den Oord, A.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; van den Driessche, G.; Lockhart, E.; Cobo, L.; Stimberg, F.; Casagrande, N.; Grewe, D.; Noury, S.; Dieleman, S.; Elsen, E.; Kalchbrenner, N.; Zen, H.; Graves, A.; King, H.; Walters, T.; Belov, D.; and Hassabis, D. 2018. Parallel WaveNet: Fast high-fidelity speech synthesis. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 3918–3926.
    Google ScholarLocate open access versionFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
    Findings
  • Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; Klingner, J.; Shah, A.; Johnson, M.; Liu, X.; Kaiser, Ł.; Gouws, S.; Kato, Y.; Kudo, T.; Kazawa, H.; Stevens, K.; Kurian, G.; Patil, N.; Wang, W.; Young, C.; Smith, J.; Riesa, J.; Rudnick, A.; Vinyals, O.; Corrado, G.; Hughes, M.; and Dean, J. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv e-prints.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments