Pre training Multilingual Neural Machine Translation by Leveraging Alignment Information

EMNLP 2020, pp.2649-2663, (2020)

被引用4|浏览377
下载 PDF 全文
引用
微博一下

摘要

We investigate the following question for machine translation (MT): can we develop a single universal MT model to serve as the common seed and obtain derivative and improved models on arbitrary language pairs? We propose mRASP, an approach to pre-train a universal multilingual neural machine translation model. Our key idea in mRASP is its...更多
0
简介
重点内容
  • Pre-trained language models such as BERT have been highly effective for NLP tasks (Peters et al, 2018; Devlin et al, 2019; Radford et al, 2019; Conneau and Lample, 2019; Liu et al, 2019; Yang et al, 2019)
  • We propose multilingual Random Aligned Substitution Pre-training, a method to pre-train a machine translation (MT) model for many languages, which can be used as a common initial model to fine-tune on arbitrary language pairs. mRASP will improve the translation performance, comparing to the MT models directly trained on downstream parallel data
  • We propose mRASP, an effective pre-training method that can be utilized to fine-tune on any language pairs in NMT
  • For extremely low resources setting such as En-Be (Belarusian) where the amount of datasets cannot train an NMT model properly, utilizing the pre-training model boosts performance
  • We propose a multilingual neural machine translation pre-training model
  • Extensive experiments are conducted on different scenarios, including low/medium/rich resource and exotic corpus, demonstrating the efficacy of mRASP
方法
  • The authors introduce the proposed mRASP and the training details. 2.1 mRASP

    Architecture The authors adopt a standard Transformerlarge architecture (Vaswani et al, 2017) with 6layer encoder and 6-layer decoder.
  • The authors introduce the proposed mRASP and the training details.
  • A multilingual neural machine translation model learns a many-to-many mapping function f to translate from one language to another.
  • Where xi represents a sentence in language Li, and θ is the parameter of mRASP, and C(xi) is the proposed alignment function, which randomly replaces the words in xi with a different language.
  • In the pre-training phase, the model jointly learns all the translation pairs.
  • The following En→Fr sentence “How are you? -> Comment vas tu? ” is transformed to “ How are you? -> Comment vas tu?”
结果
  • The authors first conduct experiments on the low-resource and medium-resource datasets, where multilingual translation usually obtains significant improvements.
  • As illustrated in Table 1, the authors obtain significant gains in all datasets.
  • For extremely low resources setting such as En-Be (Belarusian) where the amount of datasets cannot train an NMT model properly, utilizing the pre-training model boosts performance.
  • The authors obtain consistent improvements in low and medium resource datasets.
  • The authors observe that with the scale of the dataset increasing, the gap between the randomly initialized baseline and pre-training model is becoming closer.
  • It is worth noting that, for En→De benchmark, the authors obtain 1.0 BLEU points gains9
结论
  • The authors propose a multilingual neural machine translation pre-training model.
  • To bridge the semantic space between different languages, the authors incorporate word alignment into the pre-training model.
  • Extensive experiments are conducted on different scenarios, including low/medium/rich resource and exotic corpus, demonstrating the efficacy of mRASP.
  • The authors conduct a set of analytical experiments to quantify the model, showing that the alignment information does bridge the gap between languages as well as boost the performance.
  • The authors will pre-train on larger corpus to further boost the performance
表格
  • Table1: Fine-tuning performance on extremely low / low / medium resource machine translation settings. The numbers in parentheses indicate the size of parallel corpus for fine-tuning. Pre-training with mRASP and then fine-tuning on downstream MT tasks consistently improves over MT models directly trained on bilingual parallel corpus
  • Table2: Fine-tuning performance for popular medium and rich resource MT tasks. For fair comparison, we report detokenized BLEU on WMT newstest18 for Zh→En and tokenized BLEU on WMT newstest14 for En→Fr and En→De. Notice unlike previous methods (except CTNMT) which do not improve in the rich resource settings, mRASP is again able to consistently improve the downstream MT performance. It is the first time to verify that low-resource language pairs can be utilized to improve rich resource MT
  • Table3: Fine-tuning MT performance on exotic language corpus. For two the translation direction A→B, exotic pair: A and B occur in the pre-training corpus but no pairs of sentences of (A,B) occur; exotic full: no sentences in either A nor B occur in the pre-training; exotic source: sentences from the target side B occur in the pre-training but not the source side A; exotic target: sentences from the source side A occur in the pre-training but not the target side B. Notice that pre-training with mRASP and fine-tuning on those exotic languages consistently obtains significant improvements MT performance in each category
  • Table4: Comparison with previous Pre-training models on WMT16 En-Ro. Following (<a class="ref-link" id="cLiu_et+al_2020_a" href="#rLiu_et+al_2020_a">Liu et al, 2020</a>), We report detokenized BLEU. We reaches comparable results on both En→Ro and Ro→En. By combining back translation, the performance further boost for 2 BLEU points on Ro→En
  • Table5: Comprehensive comparison with mBART. mRASP outperforms mBART on MT for all but two language pairs
  • Table6: MT performance of mRASP with and without the RAS technique and fine-tuning strategy. mRASP includes both the RAS technique and fine-tuning strategy. “w/o ft” denotes “without fine-tuning”. We also report mRASP without fine-tuning and NAS to compare with mRASP without fine-tuning. Both RAS and fine-tuning proves effective and essential for mRASP
  • Table7: The MT performance of three language pairs with and without alignment information (mRASP w/o RAS) at pre-training phase. We see consistent performance gains for mRASP with RAS
  • Table8: Statistics of the dataset PC32 for pre-training. Each entry shows the number of parallel sentence pairs between English and other language X
Download tables as Excel
相关工作
  • Multilingual NMT aims at taking advantage of multilingual data to improve NMT for all languages involved, which has been extensively studied in a number of papers such as Dong et al (2015); Johnson et al (2017); Lu et al (2018); Rahimi et al (2019); Tan et al (2019). The most related work to mRASP is Rahimi et al (2019), which performs extensive experiments in training massively multilingual NMT models. They show that multilingual many-to-many models are effective in low resource settings. Inspired by their work, we believe that the translation quality of low-resource language pairs may improve when trained together with richresource ones. However, we are different in at least two aspects: a) Our goal is to find the best practice of a single language pair with multilingual pretraining. Multilingual NMT usually achieves inferior accuracy compared with its counterpart, which trains an individual model for each language pair when there are dozens of language pairs. b) Different from multilingual NMT, mRASP can obtain improvements with rich-resource language pairs, such as English-Frence.
研究对象与分析
language pairs: 32
Our key idea in mRASP is its novel technique of random aligned substitution, which brings words and phrases with similar meanings across multiple languages closer in the representation space. We pre-train a mRASP model on 32 language pairs jointly with only public datasets. The model is then fine-tuned on downstream language pairs to obtain specialized MT models

English-centric language pairs: 32
2.2 Pre-training Data. We collect 32 English-centric language pairs, resulting in 64 directed translation pairs in total. English is served as an anchor language bridging all other languages

pairs: 14
3.1 Experiment Settings. Datasets We collect 14 pairs of parallel corpus to simulate different scenarios. Most of the En-X parallel datasets are from the pre-training phase to avoid introducing new information

pairs: 3
And for En-Af, we observe that the overlap between two space becomes larger. We also randomly plot the position of three pairs of words, with each pair has the same meaning in different languages. Average Cosine Similarity simply add all subwords constituting the word

引用论文
  • Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. CoRR, abs/1602.01925.
    Findings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 7057–7067.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 201Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1723–1732. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Xavier Garcia, Pierre Foret, Thibault Sellam, and Ankur P. Parikh. 2020. A multilingual view of unsupervised machine translation. CoRR, abs/2002.02955.
    Findings
  • Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. CoRR, abs/1611.04798.
    Findings
  • Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415.
    Findings
  • Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. 201Unicoder: A universal language encoder by pretraining with multiple cross-lingual tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language
    Google ScholarLocate open access versionFindings
  • Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2485–2494. Association for Computational Linguistics.
    Google ScholarFindings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguistics, 5:339–351.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2018b. Word translation without parallel data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bei Li, Yinqiao Li, Chen Xu, Ye Lin, Jiqiang Liu, Hui Liu, Ziyang Wang, Yuhao Zhang, Nuo Xu, Zeyang Wang, Kai Feng, Hexuan Chen, Tengbo Liu, Yanyang Li, Qiang Wang, Tong Xiao, and Jingbo Zhu. 2019. The niutrans machine translation systems for WMT19. In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pages 257–266. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. CoRR, abs/2001.08210.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
    Findings
  • Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 20A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine
    Google ScholarLocate open access versionFindings
  • Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 84– 92. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 1–9. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543. ACL.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 186–191. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 529–535. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. Massively multilingual transfer for NER. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 151–164. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Samuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 5926–5936. PMLR.
    Google ScholarLocate open access versionFindings
  • Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 1006– 1011. The Association for Computational Linguistics.
    Google ScholarFindings
  • Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Weinan Zhang, Yong Yu, and Lei Li. 2020. Towards making the most of BERT in neural machine translation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 9378–9385. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 5754–5764.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科