Cross-lingual Retrieval for Iterative Self-Supervised Training

NIPS 2020, 2020.

Cited by: 2|Bibtex|Views31
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We introduced a new self-supervised training approach that iteratively combine mining and multilingual training procedures to achieve state-of-the-art performances in unsupervised machine translation and sentence retrieval

Abstract:

Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained language models. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach -- cross...More
0
Introduction
  • Pretraining has demonstrated success in various natural language processing (NLP) tasks.
  • CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner.
  • Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats.
  • The authors present the following contributions in this paper:
Highlights
  • Pretraining has demonstrated success in various natural language processing (NLP) tasks
  • cross-lingual retrieval for iterative self-supervised training (CRISS) is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner
  • We significantly outperform the previous state of the art on unsupervised machine translation and sentence retrieval
  • We introduced a new self-supervised training approach that iteratively combine mining and multilingual training procedures to achieve state-of-the-art performances in unsupervised machine translation and sentence retrieval
  • Future work should explore (1) a thorough analysis and theoretical understanding on how the language agnostic representation arises from denoising pretrainining, (2) whether the same approach can be extended to pretrain models for non-seq2seq applications, e.g. unsupervised structural discovery and alignment, and (3) whether the learned cross-lingual representation can be applied to other other NLP and non-NPL tasks and how
  • Our work advances the state-of-the-art in unsupervised machine translation
Results
  • The authors found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs
  • The authors utilized these findings to develop a new approach -- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.
  • The authors' approach even beats the state-of-the-art supervised approach [2] in Kazakh, improving accuracy from 17.39% to 77.9%
Conclusion
  • The authors introduced a new self-supervised training approach that iteratively combine mining and multilingual training procedures to achieve state-of-the-art performances in unsupervised machine translation and sentence retrieval.
  • For languages where labelled parallel data is hard to obtain, training methods that better utilize unlabeled data is key to unlocking better translation quality.
  • This technique can help to remove language barriers across the world, for the community speaking low resource languages
Summary
  • Introduction:

    Pretraining has demonstrated success in various natural language processing (NLP) tasks.
  • CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner.
  • Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats.
  • The authors present the following contributions in this paper:
  • Results:

    The authors found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs
  • The authors utilized these findings to develop a new approach -- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.
  • The authors' approach even beats the state-of-the-art supervised approach [2] in Kazakh, improving accuracy from 17.39% to 77.9%
  • Conclusion:

    The authors introduced a new self-supervised training approach that iteratively combine mining and multilingual training procedures to achieve state-of-the-art performances in unsupervised machine translation and sentence retrieval.
  • For languages where labelled parallel data is hard to obtain, training methods that better utilize unlabeled data is key to unlocking better translation quality.
  • This technique can help to remove language barriers across the world, for the community speaking low resource languages
Tables
  • Table1: Unsupervised machine translation. CRISS outperforms others in 9 out of 10 directions
  • Table2: Sentence retrieval accuracy on Tatoeba; XLMR results are the previous SOTA (except zh where mBERT is the SOTA). LASER is a supervised method listed for reference
  • Table3: Supervised machine translation downstream task
Download tables as Excel
Related work
  • Emergent Cross-Lingual Alignment On understanding cross-lingual alignment from pretrained models, [38] [24] present empirical evidence that there exists cross-lingual alignment structure in the encoder which is trained with multiple languages on a shared masked language modeling task. Analysis from [35] shows that shared subword vocabulary has negligible effect while model depth matters more for cross-lingual transferability. In English language modeling, retrieval-based data augmentation has been explored by [13] and [9]. Our work combines this idea with the emergent cross-lingual alignment to retrieve sentences in another language instead of retrieving paraphrases in the same language.

    Multilingual Pretraining Methods With large amount of unlabeled data, various self-supervised pretraining approaches have been proposed to initialize models or parts of the models for downstream tasks (e.g. machine translation, classification, inference and so on) [6, 23, 26, 40, 20, 27, 32, 7, 28, 17, 19]. Recently these approaches have been extended from single language training to crosslingual training [34, 15, 4, 19]. In the supervised machine learning literature, data augmentation [39, 14, 1, 31] has been applied to improve learning performance. To the best our knowledge, little work has been explored on self-supervised data augmentation for pretraining. This work pretrains multilingual model with self-supervised data augmenting procedure using the power of emergent crosslingual representation alignment discovered by the model itself in an iterative manner.
Reference
  • Mikel Artetxe and Holger Schwenk. Margin-based parallel corpus mining with multilingual sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3197–3203, 2019.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot crosslingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 2019.
    Google ScholarLocate open access versionFindings
  • Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Niehues Jan, Stüker Sebastian, Sudoh Katsuitho, Yoshino Koichiro, and Federmann Christian. Overview of the IWSLT 2017 evaluation campaign. In International Workshop on Spoken Language Translation, pages 2–14, 2017.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
    Findings
  • Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pages 7057–7067, 2019.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
    Findings
  • Xavier Garcia, Pierre Foret, Thibault Sellam, and Ankur P Parikh. A multilingual view of unsupervised machine translation. arXiv preprint arXiv:2002.02955, 2020.
    Findings
  • Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrievalaugmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020.
    Findings
  • Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. The flores evaluation datasets for low-resource machine translation: Nepali–english and sinhala–english. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6100–6113, 2019.
    Google ScholarLocate open access versionFindings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080, 2020.
    Findings
  • Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
    Google ScholarLocate open access versionFindings
  • Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
    Findings
  • Huda Khayrallah, Hainan Xu, and Philipp Koehn. The JHU parallel corpus filtering systems for WMT 2018. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 909–912, Belgium, Brussels, October 2018. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
    Findings
  • Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
    Findings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
    Findings
  • Zuchao Li, Rui Wang, Kehai Chen, Masso Utiyama, Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao. Data-dependent gaussian prior objective for language generation. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210, 2020.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237, 2018.
    Google ScholarLocate open access versionFindings
  • Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual bert? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, 2019.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, 2018.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
    Findings
  • Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, and Shuai Ma. Explicit cross-lingual pre-training for unsupervised machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 770–779, 2019.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, and Armand Joulin. Ccmatrix: Mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944, 2019.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany, August 2016. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning, pages 5926–5936, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 2017.
    Google ScholarLocate open access versionFindings
  • Takashi Wada and Tomoharu Iwata. Unsupervised cross-lingual word embedding by multilingual neural language models. CoRR, abs/1809.02306, 2018.
    Findings
  • Zihan Wang, Stephen Mayhew, Dan Roth, et al. Cross-lingual ability of multilingual bertg: An empirical study. arXiv preprint arXiv:1912.07840, 2019.
    Findings
  • Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
    Findings
  • Jiawei Wu, Xin Wang, and William Yang Wang. Extract and edit: An alternative to back-translation for unsupervised neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1173–1183, 2019.
    Google ScholarLocate open access versionFindings
  • Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464, 2019.
    Findings
  • Hainan Xu and Philipp Koehn. Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2945–2950, Denmark, Cophenhagen, September 2017. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. XLNET: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764, 2019.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments