Depth Growing for Neural Machine Translation

ACL (1), pp. 5558-5563, 2019.

Cited by: 8|Bibtex|Views64|DOI:https://doi.org/10.18653/v1/p19-1558
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We propose a new training strategy with three specially designed components, including cross-module residual connection, hierarchical encoder-decoder attention and deep-shallow decoding, to construct and train deep neural machine translation models

Abstract:

While very deep neural networks have shown effectiveness for computer vision and text classification applications, how to increase the network depth of neural machine translation (NMT) models for better translation quality remains a challenging problem. Directly stacking more blocks to the NMT model results in no improvement and even re...More

Code:

Data:

0
Introduction
Highlights
  • Neural machine translation, which is built upon deep neural networks, has gained rapid progress in recent years (Bahdanau et al, 2014; Sutskever et al, 2014; Sennrich et al, 2015; He et al, 2016a; Sennrich et al, 2016a; Xia et al, 2017; Wang et al, 2019) and achieved significant improvement in translation quality (Hassan et al, 2018)
  • We propose a new training strategy with three specially designed components, including cross-module residual connection, hierarchical encoder-decoder attention and deep-shallow decoding, to construct and train deep neural machine translation models
  • We show that our approach can effectively construct deeper model with significantly better performance over the state-of-the-art transformer baseline
  • Only empirical studies on the transformer are presented in this paper, our proposed strategy is a general approach that can be universally applicable to arbitrary model architectures, including LSTM and CNN
  • We will further explore an efficient strategy that can jointly train all modules of the deep model with minimal increase in training complexity
Methods
  • Datasets The authors conduct experiments to evaluate the effectiveness of the proposed method on two widely adopted benchmark datasets: the WMT141 English→German translation (En→De) and the WMT14 English→French translation (En→Fr).
  • Architecture The basic encoder-decoder framework the authors use is the strong Transformer model.
  • The dropout rate is 0.3 for En→De and 0.1 for En→Fr. The authors set the number of encoder/decoder blocks for the bottom module as N = 6 following the common practice, and set the number of stacked blocks of the top module as M = 2.
  • The authors' models are implemented based on the PyTorch implementation of Transformer4 and the code can be found in the supplementary materials
Results
  • The authors compare the method (Ours) with the Transformer baselines of 6 blocks (6B) and 8 blocks (8B), and a 16-block Transformer with transparent attention (Transparent Attn (16B))6 (Bapna et al, 2018).
  • The authors reproduce a 6-block Transformer baseline, which has better performance than what is reported in (Vaswani et al, 2017) and the authors use it to initialize the bottom module in the model.
  • From the results in Table 1, the authors see that the proposed approach enables effective training for deeper network and achieves significantly better performances compared to baselines.
  • The improvements are statistically significant with p < 0.01 in paired bootstrap sampling (Koehn, 2004)
Conclusion
  • The authors propose a new training strategy with three specially designed components, including cross-module residual connection, hierarchical encoder-decoder attention and deep-shallow decoding, to construct and train deep NMT models.
  • Only empirical studies on the transformer are presented in this paper, the proposed strategy is a general approach that can be universally applicable to arbitrary model architectures, including LSTM and CNN.
  • The authors will further explore an efficient strategy that can jointly train all modules of the deep model with minimal increase in training complexity
Summary
  • Introduction:

    Neural machine translation, which is built upon deep neural networks, has gained rapid progress in recent years (Bahdanau et al, 2014; Sutskever et al, 2014; Sennrich et al, 2015; He et al, 2016a; Sennrich et al, 2016a; Xia et al, 2017; Wang et al, 2019) and achieved significant improvement in translation quality (Hassan et al, 2018).
  • The NMT models are generally constructed with up to 6 encoder and decoder blocks in both state-of-the-art research work and champion systems of machine translation competition.
  • The LSTM-based models are usually stacked for 4 (Stahlberg et al, 2018) or 6 (Chen et al, 2018) blocks, and the state-of-the-art Transformer models are equipped with a 6-block encoder and decoder (Vaswani et al, 2017; JunczysDowmunt, 2018; Edunov et al, 2018).
  • Increasing the NMT model depth by directly stacking more blocks results in no improvement or performance drop (Figure 1), and even leads to optimization failure (Bapna et al, 2018)
  • Methods:

    Datasets The authors conduct experiments to evaluate the effectiveness of the proposed method on two widely adopted benchmark datasets: the WMT141 English→German translation (En→De) and the WMT14 English→French translation (En→Fr).
  • Architecture The basic encoder-decoder framework the authors use is the strong Transformer model.
  • The dropout rate is 0.3 for En→De and 0.1 for En→Fr. The authors set the number of encoder/decoder blocks for the bottom module as N = 6 following the common practice, and set the number of stacked blocks of the top module as M = 2.
  • The authors' models are implemented based on the PyTorch implementation of Transformer4 and the code can be found in the supplementary materials
  • Results:

    The authors compare the method (Ours) with the Transformer baselines of 6 blocks (6B) and 8 blocks (8B), and a 16-block Transformer with transparent attention (Transparent Attn (16B))6 (Bapna et al, 2018).
  • The authors reproduce a 6-block Transformer baseline, which has better performance than what is reported in (Vaswani et al, 2017) and the authors use it to initialize the bottom module in the model.
  • From the results in Table 1, the authors see that the proposed approach enables effective training for deeper network and achieves significantly better performances compared to baselines.
  • The improvements are statistically significant with p < 0.01 in paired bootstrap sampling (Koehn, 2004)
  • Conclusion:

    The authors propose a new training strategy with three specially designed components, including cross-module residual connection, hierarchical encoder-decoder attention and deep-shallow decoding, to construct and train deep NMT models.
  • Only empirical studies on the transformer are presented in this paper, the proposed strategy is a general approach that can be universally applicable to arbitrary model architectures, including LSTM and CNN.
  • The authors will further explore an efficient strategy that can jointly train all modules of the deep model with minimal increase in training complexity
Tables
  • Table1: The test set performances of WMT14 En→De and En→Fr translation tasks. ‘†’ denotes the performance figures reported in the previous works
Download tables as Excel
Funding
  • Proposes an effective two-stage approach with three specially designed components to construct deeper NMT models, which result in significant improvements over the strong Transformer baselines on WMT14 English→German and English→French translation tasks
  • Explores the potential of leveraging deep neural networks for NMT and propose a new approach to construct and train deeper NMT models
  • Evaluates our approach on two large-scale benchmark datasets, WMT14 English→German and English→French translations
  • (2) Hierarchical encoder-decoder attention: introduces a hierarchical encoder-decoder attention calculated with different contextual representations as shown in Eqn.(2) and , where h1 is used as key and value for attn1 in the bottom module, and h2 for attn2 in the top module
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. 2018. Training deeper neural machine translation models with transparent attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3028–3033.
    Google ScholarLocate open access versionFindings
  • Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
    Findings
  • Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016a. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Marcin Junczys-Dowmunt. 2018. Microsoft’s submission to the wmt2018 news translation task: How i learned to stop worrying and love the data. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 425–430.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 86–96.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1715–1725.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
    Findings
  • Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387.
    Findings
  • Felix Stahlberg, Adriade Gispert, and Bill Byrne. 2018. The university of cambridges machine translation systems for wmt18. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 504–512.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun Liu. 2017. Deep neural machine translation with linear associative unit. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 136–145.
    Google ScholarLocate open access versionFindings
  • Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Multiagent dual learning. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Lijun Wu, Fei Tian, Li Zhao, Jianhuang Lai, and TieYan Liu. 2018. Word attention for sequence to sequence text understanding. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems, pages 1784–1794.
    Google ScholarLocate open access versionFindings
  • Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fast-forward connections for neural machine translation. Transactions of the Association for Computational Linguistics, 4:371–383.
    Google ScholarLocate open access versionFindings
  • Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments