# Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization

AAAI, 2018.

EI

Weibo:

Abstract:

Neural machine translation (NMT) heavily relies on parallel bilingual data for training. Since large-scale, high-quality parallel corpora are usually costly to collect, it is appealing to exploit monolingual corpora to improve NMT. Inspired by the law of total probability, which connects the probability of a given target-side monolingual ...More

Code:

Data:

Introduction

- Machine translation aims at mapping a sentence from the source language space X into the target language space Y.
- While neural networks have led to better performance, the huge number, usually tens of millions, of parameters in the NMT model raises a major challenge that it heavily relies on large-scale parallel bilingual corpora for model training.
- Neural machine translation systems are typically implemented based on an encoder-decoder neural network framework, which learns a conditional probability P (y|x) from a source language sentence x to a target language sentence y.
- Parallel bilingual corpora are usually quite limited in either quantity or coverage, making it appealing to exploit large-scale monolingual corpora to improve NMT

Highlights

- Machine translation aims at mapping a sentence from the source language space X into the target language space Y
- For the translation from English to French, our method outperforms the RNNSearch model with MLE training objective by 2.93 points, and outperforms the strongest baseline dualNMT by 0.79 point
- For the translation from German to English, our method outperforms the basic Neural Machine Translation model by 1.36 points, and outperforms dual-Neural Machine Translation by 0.3 points
- We have proposed a new method, dual transfer learning, to leverage monolingual corpora from a probabilistic perspective for neural machine translation
- The central idea is to exploit the probabilistic connection between the marginal distribution and the conditional distribution using the law of total probability
- We will enrich theoretical study to better understand dual transfer learning with marginal distribution regularization

Methods

- Settings Datasets The authors evaluated the approach on two translation tasks: English→French (En→Fr) and German→English (De→En).
- For English→French task, the authors used a subset of the bilingual corpus from WMT’14 for training, which contains 12M sentence pairs.
- The validation and test sets for English→French contain 6k and 3k sentence pairs respectively.
- For German→English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test.
- Baseline Methods The authors compared the approach with several strong baselines, including a well-known attention-based NMT system RNNSearch (Bahdanau, Cho, and Bengio 2015), a deep LSTM structure, and several semi-supervised NMT models:

Results

- The authors report the experiment results in this subsection. Table 1 shows the results of the method and three semisupervised baselines with the aligned network structure.
- For the translation from English to French, the method outperforms the RNNSearch model with MLE training objective by 2.93 points, and outperforms the strongest baseline dualNMT by 0.79 point.
- For the translation from German to English, the method outperforms the basic NMT model by 1.36 points, and outperforms dual-NMT by 0.3 points.
- Improvements brought by the algorithm are significant compared with the basic NMT model.
- These results demonstrate the effectiveness of the algorithm

Conclusion

- The authors have proposed a new method, dual transfer learning, to leverage monolingual corpora from a probabilistic perspective for neural machine translation.
- A data-dependent regularization term is introduced to guide the training procedure to satisfy the probabilistic connection.
- The authors will enrich theoretical study to better understand dual transfer learning with marginal distribution regularization.
- The authors will investigate the limit of the approach with respect to the increase of the size of monolingual data as well as sample size K

Summary

## Introduction:

Machine translation aims at mapping a sentence from the source language space X into the target language space Y.- While neural networks have led to better performance, the huge number, usually tens of millions, of parameters in the NMT model raises a major challenge that it heavily relies on large-scale parallel bilingual corpora for model training.
- Neural machine translation systems are typically implemented based on an encoder-decoder neural network framework, which learns a conditional probability P (y|x) from a source language sentence x to a target language sentence y.
- Parallel bilingual corpora are usually quite limited in either quantity or coverage, making it appealing to exploit large-scale monolingual corpora to improve NMT
## Methods:

Settings Datasets The authors evaluated the approach on two translation tasks: English→French (En→Fr) and German→English (De→En).- For English→French task, the authors used a subset of the bilingual corpus from WMT’14 for training, which contains 12M sentence pairs.
- The validation and test sets for English→French contain 6k and 3k sentence pairs respectively.
- For German→English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test.
- Baseline Methods The authors compared the approach with several strong baselines, including a well-known attention-based NMT system RNNSearch (Bahdanau, Cho, and Bengio 2015), a deep LSTM structure, and several semi-supervised NMT models:
## Results:

The authors report the experiment results in this subsection. Table 1 shows the results of the method and three semisupervised baselines with the aligned network structure.- For the translation from English to French, the method outperforms the RNNSearch model with MLE training objective by 2.93 points, and outperforms the strongest baseline dualNMT by 0.79 point.
- For the translation from German to English, the method outperforms the basic NMT model by 1.36 points, and outperforms dual-NMT by 0.3 points.
- Improvements brought by the algorithm are significant compared with the basic NMT model.
- These results demonstrate the effectiveness of the algorithm
## Conclusion:

The authors have proposed a new method, dual transfer learning, to leverage monolingual corpora from a probabilistic perspective for neural machine translation.- A data-dependent regularization term is introduced to guide the training procedure to satisfy the probabilistic connection.
- The authors will enrich theoretical study to better understand dual transfer learning with marginal distribution regularization.
- The authors will investigate the limit of the approach with respect to the increase of the size of monolingual data as well as sample size K

- Table1: BLEU scores on En→Fr and De→En translation tasks. Δ means the improvement over the basic NMT model, which only used bilingual data for training. The basic model for En→Fr is the RNNSearch model (<a class="ref-link" id="cBahdanau_et+al_2015_a" href="#rBahdanau_et+al_2015_a">Bahdanau, Cho, and Bengio 2015</a>), and for De→En is a two-layer LSTM model. Note that all the methods for the same task share the same model structure
- Table2: Deep NMT systems’ performances on En→Fr translation

Related work

- Exploring monolingual data for machine translation has attracted intensive attention in recent years. The methods proposed for this purpose could be divided into three categories: (1) integrating language model trained with monolingual data into NMT model, (2) generating pseudo sentence pairs from monolingual data and (3) jointly training of both source-to-target and target-to-source translation models by minimizing reconstruction errors of monolingual sentences.

In the first category, a separately trained language model with monolingual data is integrated into the NMT model.

VDPSOHVL]H VDPSOHVL]H VDPSOHVL]H VDPSOHVL]H VDPSOHVL]H

WUDLQLQJWLPH KRXUV

%/(8RIGXDOWUDQVODWLRQPRGHORQ(Q'HWHVWVHW (Gulcehre et al 2015) trained language models independently with target-side monolingual sentences, and incorporated them into the neural network during decoding by rescoring of the beam or adding the recurrent hidden state of the language model to the decoder states. (Jean et al 2015) also reported experiments of reranking NMT outputs with a 5-gram language model. These methods only used monolingual data to train language models and improve NMT decoding, but do not touch the training of NMT models.

In the second category, monolingual data is translated using translation model trained from bilingual sentence pairs, and being paired with its translations to form a pseudo parallel corpus to enlarge the training data. Specifically, (Bertoldi and Federico 2009; Lambert et al 2011) have back-translated target-side monolingual data into the sourceside sentence to produce synthetic parallel data for phrasebased SMT. Similar approach also has been applied to NMT, and back-translated synthetic parallel data has been found to have a more general use in NMT than in SMT, with positive effects that go beyond domain adaption (Sennrich, Haddow, and Birch 2016). (Ueffing, Haffari, and Sarkar 2007) iteratively translated source-side monolingual data and added the reliable translations to the training data in an SMT system, and thus improved the translation model from its own translation. For these methods, there is no guarantee on the quality of generated pseudo bilingual sentence pairs, which may limit the performance gain.

Funding

- This research was partially supported by grants from the National Key Research and Development Program of China (Grant No.2016YFB1000904), and the National Natural Science Foundation of China (Grants No.61727809)

Reference

- Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. ICLR.
- Bertoldi, N., and Federico, M. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, 182–189. Association for Computational Linguistics.
- Britz, D.; Goldie, A.; Luong, T.; and Le, Q. 2017. Massive exploration of neural machine translation architectures. ACL.
- Cettolo, M.; Niehues, J.; Stuker, S.; Bentivogli, L.; and Federico, M. 201Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam.
- Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016. Semi-supervised learning for neural machine translation. arXiv preprint arXiv:1606.04596.
- Cochran, W. G. 1977. Sampling techniques. John Wiley.
- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 201Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.
- Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
- He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma, W.-Y. 2016a. Dual learning for machine translation. In Advances in Neural Information Processing Systems, 820–828.
- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- He, D.; Lu, H.; Xia, Y.; Qin, T.; Wang, L.; and Liu, T.-Y. 2017. Decoding with value networks for neural machine translation. In Advances in Neural Information Processing Systems.
- Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. Dissertation, Stanford University.
- Hesterberg, T. 1995. Weighted average importance sampling and defensive mixture distributions. Technometrics 37(2):185–194.
- Jean, S.; Firat, O.; Cho, K.; Memisevic, R.; and Bengio, Y. 2015. Montreal neural machine translation systems for wmt15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 134–140.
- Kalchbrenner, N.; Espeholt, L.; Simonyan, K.; Oord, A. v. d.; Graves, A.; and Kavukcuoglu, K. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
- Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lambert, P.; Schwenk, H.; Servan, C.; and Abdul-Rauf, S. 2011. Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, 284–293. Association for Computational Linguistics.
- Long, M.; Cao, Y.; Wang, J.; and Jordan, M. 2015. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, 97–105.
- Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2016. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, 136–144.
- Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. EMNLP.
- Mikolov, T.; Karafiat, M.; Burget, L.; Cernocky, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In Interspeech, volume 2, 3.
- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
- Raina, R.; Battle, A.; Lee, H.; Packer, B.; and Ng, A. Y. 2007. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, 759–766. ACM.
- Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Sennrich, R.; Haddow, B.; and Birch, A. 2016. Improving neural machine translation models with monolingual data. Annual Meeting of the Association for Computational Linguistics 11–11.
- Sundermeyer, M.; Schluter, R.; and Ney, H. 2012. Lstm neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association.
- Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
- Tu, Z.; Liu, Y.; Shang, L.; Liu, X.; and Li, H. 2017. Neural machine translation with reconstruction. In AAAI, 3097–3103.
- Ueffing, N.; Haffari, G.; and Sarkar, A. 2007. Semi-supervised model adaptation for statistical machine translation. Machine Translation 21(2):77–94.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; N. Gomez, A.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Xia, Y.; Bian, J.; Qin, T.; Yu, N.; and Liu, T.-Y. 2017a. Dual inference for machine learning. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, 3112–3118.
- Xia, Y.; Qin, T.; Chen, W.; Bian, J.; Yu, N.; and Liu, T.-Y. 2017b. Dual supervised learning. In International Conference on Machine Learning, 3789–3798.
- Xia, Y.; Tian, F.; Wu, L.; Lin, J.; Qin, T.; and Liu, T.-Y. 2017c. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems.
- Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
- Zeiler, M. D. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Zhou, J.; Cao, Y.; Wang, X.; Li, P.; and Xu, W. 2016. Deep recurrent models with fast-forward connections for neural machine translation. arXiv preprint arXiv:1606.04199.

Full Text

Tags

Comments