Multi task Learning for Multilingual Neural Machine Translation

Hany Hassan Awadalla
Hany Hassan Awadalla

empirical methods in natural language processing, pp. 1022-1034, 2020.

Cited by: 0|Bibtex|Views18|DOI:https://doi.org/10.18653/V1/2020.EMNLP-MAIN.75
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We show that the proposed multi-task learning approach can effectively improve the performance of Multilingual Neural Machine Translation on both high-resource and low-resource languages with large margin, and can significantly improve the translation quality for zero-shot langua...

Abstract:

While monolingual data has been shown to be useful in improving bilingual neural machine translation (NMT), effectively and efficiently leveraging monolingual data for Multilingual NMT (MNMT) systems is a less explored area. In this work, we propose a multi-task learning (MTL) framework that jointly trains the model with the translation t...More

Code:

Data:

0
Introduction
Highlights
  • Multilingual Neural Machine Translation (MNMT), which leverages a single neural machine translation (NMT) model to handle the translation of multiple languages, has drawn research attention in recent years (Dong et al, 2015; Firat et al, 2016a; Ha et al, 2016; Johnson et al, 2017; Arivazhagan et al, 2019)
  • We introduce two denoising language modeling tasks to help improve the quality of the translation model: the masked language model (MLM) task and the denoising auto-encoding (DAE) task
  • We compare the performance of the bilingual models (Bilingual), multilingual models trained on bitext only, trained on both bitext and back translation (+Back translation (BT)) and trained with the proposed multi-task learning (+MTL)
  • We propose a multi-task learning framework that jointly trains the model with the translation task on bitext data, the masked language modeling task on the source-side monolingual data and the denoising auto-encoding task on the targetside monolingual data
  • We show that the proposed MTL approach can effectively improve the performance of MNMT on both high-resource and low-resource languages with large margin, and can significantly improve the translation quality for zero-shot language pairs without bitext training data
  • For the dynamic noising ratio, we set the masking ratio for MLM to increase from 10% to 20% and blanking ratio for Denoising Auto-Encoding (DAE) to increase from 20% to 40%
  • We would like to explore the most sample efficient strategy to add a new language to a trained MNMT system
Methods
  • 4.1 Data

    The authors evaluate MTL on a multilingual setting with 10 languages to and from English (En), including French (Fr), Czech (Cs), German (De), Finnish (Fi), Latvian (Lv), Estonian (Et), Romanian (Ro), Hindi (Hi), Turkish (Tr) and Gujarati (Gu).

    Bitext Data The bitext training data comes from the WMT corpus.
  • Monolingual Data The monolingual data the authors use is mainly from NewsCrawl2.
  • The authors apply a series of filtration rules to remove the low-quality sentences, including duplicated sentences, sentences with too many punctuation marks or invalid characters, sentences with too many or too few words, etc.
  • The authors randomly select 5M filtered sentences for each language.
  • For low-resource languages without enough sentences from NewsCrawl, the authors leverage data from CCNet (Wenzek et al, 2019)
Results
  • 5.1 Main Results

    The authors compare the performance of the bilingual models (Bilingual), multilingual models trained on bitext only, trained on both bitext and back translation (+BT) and trained with the proposed multi-task learning (+MTL).
  • Translation results of the 10 languages translated to and from English are presented in Table 1 and 2 respectively.
  • Bilingual vs Multilingual: The multilingual baselines perform better on lower-resource languages, but perform worse than individual bilingual models on high-resource languages like Fr, Cs and De. Bilingual vs Multilingual: The multilingual baselines perform better on lower-resource languages, but perform worse than individual bilingual models on high-resource languages like Fr, Cs and De
  • This is in concordance with the previous observations (Arivazhagan et al, 2019) and is consistent across the three multilingual systems (i.e., X→En, En→X and X→X).
  • Multi-task learning: Models trained with multitask learning (+MTL) significantly outperform the multilingual baselines for all the languages pairs in all three multilingual systems, demonstrating the effectiveness of the proposed framework
Conclusion
  • The authors propose a multi-task learning framework that jointly trains the model with the translation task on bitext data, the masked language modeling task on the source-side monolingual data and the denoising auto-encoding task on the targetside monolingual data.
  • The authors are interested in investigating the proposed approach in a scaled setting with more languages and a larger amount of monolingual data.
  • The authors would like to explore the most sample efficient strategy to add a new language to a trained MNMT system
Summary
  • Introduction:

    Multilingual Neural Machine Translation (MNMT), which leverages a single NMT model to handle the translation of multiple languages, has drawn research attention in recent years (Dong et al, 2015; Firat et al, 2016a; Ha et al, 2016; Johnson et al, 2017; Arivazhagan et al, 2019).
  • Dual learning paradigms utilize monolingual data in both source and target language (He et al, 2016; Wang et al, 2019; Wu et al, 2019)
  • While these approaches can effectively improve the NMT performance, they have two limitations.
  • Back translation requires a good baseline model with adequate bitext data to start from, which limits its efficiency on low-resource settings
  • Methods:

    4.1 Data

    The authors evaluate MTL on a multilingual setting with 10 languages to and from English (En), including French (Fr), Czech (Cs), German (De), Finnish (Fi), Latvian (Lv), Estonian (Et), Romanian (Ro), Hindi (Hi), Turkish (Tr) and Gujarati (Gu).

    Bitext Data The bitext training data comes from the WMT corpus.
  • Monolingual Data The monolingual data the authors use is mainly from NewsCrawl2.
  • The authors apply a series of filtration rules to remove the low-quality sentences, including duplicated sentences, sentences with too many punctuation marks or invalid characters, sentences with too many or too few words, etc.
  • The authors randomly select 5M filtered sentences for each language.
  • For low-resource languages without enough sentences from NewsCrawl, the authors leverage data from CCNet (Wenzek et al, 2019)
  • Results:

    5.1 Main Results

    The authors compare the performance of the bilingual models (Bilingual), multilingual models trained on bitext only, trained on both bitext and back translation (+BT) and trained with the proposed multi-task learning (+MTL).
  • Translation results of the 10 languages translated to and from English are presented in Table 1 and 2 respectively.
  • Bilingual vs Multilingual: The multilingual baselines perform better on lower-resource languages, but perform worse than individual bilingual models on high-resource languages like Fr, Cs and De. Bilingual vs Multilingual: The multilingual baselines perform better on lower-resource languages, but perform worse than individual bilingual models on high-resource languages like Fr, Cs and De
  • This is in concordance with the previous observations (Arivazhagan et al, 2019) and is consistent across the three multilingual systems (i.e., X→En, En→X and X→X).
  • Multi-task learning: Models trained with multitask learning (+MTL) significantly outperform the multilingual baselines for all the languages pairs in all three multilingual systems, demonstrating the effectiveness of the proposed framework
  • Conclusion:

    The authors propose a multi-task learning framework that jointly trains the model with the translation task on bitext data, the masked language modeling task on the source-side monolingual data and the denoising auto-encoding task on the targetside monolingual data.
  • The authors are interested in investigating the proposed approach in a scaled setting with more languages and a larger amount of monolingual data.
  • The authors would like to explore the most sample efficient strategy to add a new language to a trained MNMT system
Tables
  • Table1: BLEU scores of 10 languages → English translation with bilingual, X→En and X→X systems. The languages are arranged from high-resource (left) to low-resource (right)
  • Table2: BLEU scores of English → 10 languages translation with bilingual, En→X and X→X systems. The languages are arranged from high-resource (left) to low-resource (right)
  • Table3: Zero-shot translation performances on highresource language pairs
  • Table4: Zero-shot translation performances on lowresource language pairs
  • Table5: Comparison of different multi-task learning objectives on De-En and Tr-En translation. BLEU scores are reported on the full individual validation set
  • Table6: BLEU scores of dynamic noising strategy on X→En translation system with large-scale monolingual data setting on validation sets
  • Table7: Evaluation on XNLI task, XLM-Roberta results is our reproduction of the results. Massively Multilingual Translation Encoder (MMTE) is reported from (Siddhant et al, 2020)
  • Table8: Evaluation on XGLUE NER task, XLMRoberta results is our reproduction of the results
  • Table9: Statistics of the parallel resources from WMT. A list of 10 languages ranked with the size of bitext corpus translating to/from English
Download tables as Excel
Funding
  • We can see that introducing MLM or DAE can both effectively improve the performance of multilingual systems, and the combination of both yields the best per-
  • For the dynamic noising ratio, we set the masking ratio for MLM to increase from 10% to 20% and blanking ratio for DAE to increase from 20% to 40%
  • We show that the proposed MTL approach can effectively improve the performance of MNMT on both high-resource and low-resource languages with large margin, and can also significantly improve the translation quality for zero-shot language pairs without bitext training data
Study subjects and analysis
language pairs: 10
In this work, we propose a multi-task learning (MTL) framework that jointly trains the model with the translation task on bitext data and two denoising tasks on the monolingual data. We conduct extensive empirical studies on MNMT systems with 10 language pairs from WMT datasets. We show that the proposed approach can effectively improve the translation quality for both high-resource and low-resource languages with large margin, achieving significantly better results than the individual bilingual models

language pairs: 10
To encourage the model to keep learning from the large-scale monolingual data, we adopt dynamic noising ratio for the denoising objectives to gradually increase the difficulty level of the tasks. We evaluate the proposed approach on a largescale multilingual setup with 10 language pairs from the WMT datasets. We study three Englishcentric multilingual systems, including many-toEnglish, English-to-many, and many-to-many

Reference
  • Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Ankur Bapna and Orhan Firat. 2019.
    Google ScholarFindings
  • Simple, scalable adaptation for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 1538– 1548.
    Google ScholarLocate open access versionFindings
  • Graeme Blackwood, Miguel Ballesteros, and Todd Ward. 2018. Multilingual neural machine translation with task-specific attention. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3112–3122.
    Google ScholarLocate open access versionFindings
  • Rich Caruana. 1997. Multitask learning. Machine learning, pages 41–75.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau and Guillaume Lample. 201Crosslingual language model pretraining. In Advances in Neural Information Processing Systems, pages 7059–7069.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adria De Gispert and Jose B Marino. 2006. Catalanenglish statistical machine translation without parallel corpus: bridging through spanish. In Proc. of 5th International Conference on Language Resources and Evaluation (LREC), pages 65–68. Citeseer.
    Google ScholarLocate open access versionFindings
  • Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 1723–1732.
    Google ScholarLocate open access versionFindings
  • Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small. Cognition, pages 71–99.
    Google ScholarLocate open access versionFindings
  • Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 72–78.
    Google ScholarLocate open access versionFindings
  • Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875.
    Google ScholarLocate open access versionFindings
  • Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T Yarman Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 268–277.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, pages 1243–1252. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor OK Li. 2018. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 344–354.
    Google ScholarLocate open access versionFindings
  • Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798.
    Findings
  • Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, pages 339–351.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations.
    Google ScholarFindings
  • Eliyahu Kiperwasser and Miguel Ballesteros. 2018. Scheduled multi-task learning: From syntax to translation. Transactions of the Association for Computational Linguistics, pages 225–240.
    Google ScholarLocate open access versionFindings
  • Toan Q Nguyen and David Chiang. 2017. Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, pages 296–301.
    Google ScholarLocate open access versionFindings
  • Jan Niehues and Eunah Cho. 2017. Exploiting linguistic resources for neural machine translation using multi-task learning. In Proceedings of the Second Conference on Machine Translation, pages 80–89.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
    Google ScholarLocate open access versionFindings
  • Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
    Findings
  • Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Matt Post. 2018. A call for clarity in reporting bleu scores. In Conference on Machine Translation.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
    Findings
  • Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039–5049.
    Google ScholarLocate open access versionFindings
  • Sukanta Sen, Kamal Kumar Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multilingual unsupervised nmt using shared encoder and languagespecific decoders. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3083–3089.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, and Ming Zhou. 2020. Xglue: A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv, abs/2004.01401.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 86–96.
    Google ScholarLocate open access versionFindings
  • Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2020. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8854–8861. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210.
    Findings
  • Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
    Google ScholarLocate open access versionFindings
  • Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn. 2019. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2062–2068.
    Google ScholarLocate open access versionFindings
  • Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 484–491.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103.
    Google ScholarLocate open access versionFindings
  • Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Multiagent dual learning. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman, Armand Joulin, and Edouard Grave. 2019. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.
    Findings
  • Lijun Wu, Yiren Wang, Yingce Xia, QIN Tao, Jianhuang Lai, and Tie-Yan Liu. 2019. Exploiting monolingual data at scale for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4198– 4207.
    Google ScholarLocate open access versionFindings
  • Poorya Zaremoodi and Gholamreza Haffari. 2018. Neural machine translation for bilingually scarce scenarios: a deep multi-task learning approach. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    Google ScholarLocate open access versionFindings
  • Jiajun Zhang and Chengqing Zong. 2016. Exploiting source-side monolingual data in neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1535–1545.
    Google ScholarLocate open access versionFindings
  • Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments