Unsupervised Cross-lingual Representation Learning at Scale

ACL, pp. 8440-8451, 2020.

Cited by: 63|Bibtex|Views258
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We introduced XLM-R, our new state of the art multilingual masked language model trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages

Abstract:

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly o...More

Code:

Data:

Introduction
  • The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised crosslingual representations at a very large scale.
  • Multilingual masked language models (MLM) like mBERT (Devlin et al, 2018) and XLM (Lample and Conneau, 2019) have pushed the stateof-the-art on cross-lingual understanding tasks by jointly pretraining large Transformer models (Vaswani et al, 2017) on many languages
  • These models allow for effective cross-lingual transfer, as seen in a number of benchmarks including cross-lingual natural language inference (Bowman et al, 2015; Williams et al, 2017; Conneau et al, 2018), question answering (Rajpurkar et al, 2016; Lewis et al, 2019), and named entity recognition (Pires et al, 2019; Wu and Dredze, 2019).
  • That this remains an important limitation for future XLU systems which may aim to improve performance with more modest computational budgets
Highlights
  • The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised crosslingual representations at a very large scale
  • We present XLM-R a transformer-based multilingual masked language model pre-trained on text in 100 languages, which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering
  • Inspired by RoBERTa, we show that multilingual BERT and XLM are undertuned, and that simple improvements in the learning procedure of unsupervised masked language models leads to much better performance
  • Performance on downstream tasks continues to improve even after validation perplexity has plateaued. Combining this observation with our implementation of the unsupervised XLM-masked language models objective, we were able to improve the performance of Lample and Conneau (2019) from 71.3% to more than 75% average accuracy on XNLI, which was on par with their supervised translation language modeling (TLM) objective
  • We introduced XLM-R, our new state of the art multilingual masked language model trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
  • We show that it provides strong gains over previous multilingual models like multilingual BERT and XLM on classification, sequence labeling and question answering
Results
  • The authors perform a comprehensive analysis of multilingual masked language models.
  • Much of the work done on understanding the crosslingual effectiveness of mBERT or XLM (Pires et al, 2019; Wu and Dredze, 2019; Lewis et al., 2019) has focused on analyzing the performance of fixed pretrained models on downstream tasks.
  • The authors present a comprehensive study of different factors that are important to pretraining large scale multilingual models.
  • The authors highlight the trade-offs and limitations of these models as the authors scale to one hundred languages
Conclusion
  • The authors introduced XLM-R, the new state of the art multilingual masked language model trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
  • The authors show that it provides strong gains over previous multilingual models like mBERT and XLM on classification, sequence labeling and question answering.
  • The authors expose the surprising effectiveness of multilingual models over monolingual models, and show strong improvements on low-resource languages
Summary
  • Introduction:

    The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised crosslingual representations at a very large scale.
  • Multilingual masked language models (MLM) like mBERT (Devlin et al, 2018) and XLM (Lample and Conneau, 2019) have pushed the stateof-the-art on cross-lingual understanding tasks by jointly pretraining large Transformer models (Vaswani et al, 2017) on many languages
  • These models allow for effective cross-lingual transfer, as seen in a number of benchmarks including cross-lingual natural language inference (Bowman et al, 2015; Williams et al, 2017; Conneau et al, 2018), question answering (Rajpurkar et al, 2016; Lewis et al, 2019), and named entity recognition (Pires et al, 2019; Wu and Dredze, 2019).
  • That this remains an important limitation for future XLU systems which may aim to improve performance with more modest computational budgets
  • Objectives:

    The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised crosslingual representations at a very large scale.
  • Results:

    The authors perform a comprehensive analysis of multilingual masked language models.
  • Much of the work done on understanding the crosslingual effectiveness of mBERT or XLM (Pires et al, 2019; Wu and Dredze, 2019; Lewis et al., 2019) has focused on analyzing the performance of fixed pretrained models on downstream tasks.
  • The authors present a comprehensive study of different factors that are important to pretraining large scale multilingual models.
  • The authors highlight the trade-offs and limitations of these models as the authors scale to one hundred languages
  • Conclusion:

    The authors introduced XLM-R, the new state of the art multilingual masked language model trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
  • The authors show that it provides strong gains over previous multilingual models like mBERT and XLM on classification, sequence labeling and question answering.
  • The authors expose the surprising effectiveness of multilingual models over monolingual models, and show strong improvements on low-resource languages
Tables
  • Table1: Results on cross-lingual classification. We report the accuracy on each of the 15 XNLI languages and the average accuracy. We specify the dataset D used for pretraining, the number of models #M the approach requires and the number of languages #lg the model handles. Our XLM-R results are averaged over five different seeds. We show that using the translate-train-all approach which leverages training sets from multiple languages, XLM-R obtains a new state of the art on XNLI of 83.6% average accuracy. Results with † are from <a class="ref-link" id="cHuang_et+al_2019_a" href="#rHuang_et+al_2019_a">Huang et al (2019</a>)
  • Table2: Results on named entity recognition on CoNLL-2002 and CoNLL-2003 (F1 score). Results with † are from Wu and Dredze (2019). Note that mBERT and XLM-R do not use a linear-chain CRF, as opposed to <a class="ref-link" id="cAkbik_et+al_2018_a" href="#rAkbik_et+al_2018_a">Akbik et al (2018</a>) and <a class="ref-link" id="cLample_et+al_2016_a" href="#rLample_et+al_2016_a">Lample et al (2016</a>)
  • Table3: Results on MLQA question answering We report the F1 and EM (exact match) scores for zero-shot classification where models are fine-tuned on the English Squad dataset and evaluated on the 7 languages of MLQA. Results with † are taken from the original MLQA paper <a class="ref-link" id="cLewis_et+al_2019_a" href="#rLewis_et+al_2019_a">Lewis et al (2019</a>)
  • Table4: GLUE dev results. Results with † are from <a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al (2019</a>). We compare the performance of XLMR to BERTLarge, XLNet and RoBERTa on the English GLUE benchmark
  • Table5: Multilingual versus monolingual models (BERT-BASE). We compare the performance of monolingual models (BERT) versus multilingual models (XLM) on seven languages, using a BERT-BASE architecture. We choose a vocabulary size of 40k and 150k for monolingual and multilingual models
  • Table6: Languages and statistics of the CC-100 corpus. We report the list of 100 languages and include the number of tokens (Millions) and the size of the data (in GiB) for each language. Note that we also include romanized variants of some non latin languages such as Bengali, Hindi, Tamil, Telugu and Urdu
  • Table7: Details on model sizes. We show the tokenization used by each Transformer model, the number of layers L, the number of hidden states of the model Hm, the dimension of the feed-forward layer Hff , the number of attention heads A, the size of the vocabulary V and the total number of parameters #params. For Transformer encoders, the number of parameters can be approximated by 4LHm2 + 2LHmHff + V Hm. GPT2 numbers are from <a class="ref-link" id="cRadford_et+al_2019_a" href="#rRadford_et+al_2019_a">Radford et al (2019</a>), mm-NMT models are from the work of <a class="ref-link" id="cArivazhagan_et+al_2019_a" href="#rArivazhagan_et+al_2019_a">Arivazhagan et al (2019</a>) on massively multilingual neural machine translation (mmNMT), and T5 numbers are from <a class="ref-link" id="cRaffel_et+al_2019_a" href="#rRaffel_et+al_2019_a">Raffel et al (2019</a>). While XLM-R is among the largest models partly due to its large embedding layer, it has a similar number of parameters than XLM-100, and remains significantly smaller that recently introduced Transformer models for multilingual MT and transfer learning. While this table gives more hindsight on the difference of capacity of each model, note it does not highlight other critical differences between the models
Download tables as Excel
Related work
  • From pretrained word embeddings (Mikolov et al, 2013b; Pennington et al, 2014) to pretrained contextualized representations (Peters et al, 2018; Schuster et al, 2019) and transformer based language models (Radford et al, 2018; Devlin et al, 2018), unsupervised representation learning has significantly improved the state of the art in natural language understanding. Parallel work on cross-lingual understanding (Mikolov et al, 2013a; Schuster et al, 2019; Lample and Conneau, 2019) extends these systems to more languages and to the cross-lingual setting in which a model is learned in one language and applied in other languages.

    Most recently, Devlin et al (2018) and Lample and Conneau (2019) introduced mBERT and XLM - masked language models trained on multiple languages, without any cross-lingual supervision. Lample and Conneau (2019) propose translation language modeling (TLM) as a way to leverage parallel data and obtain a new state of the art on the cross-lingual natural language inference (XNLI) benchmark (Conneau et al, 2018). They further show strong improvements on unsupervised machine translation and pretraining for sequence generation. Wu et al (2019) shows that monolingual BERT representations are similar across languages, explaining in part the natural emergence of multilinguality in bottleneck architectures. Separately, Pires et al (2019) demonstrated the effectiveness of multilingual models like mBERT on sequence labeling tasks. Huang et al (2019) showed gains over XLM using cross-lingual multi-task learning, and Singh et al (2019) demonstrated the efficiency of cross-lingual data augmentation for cross-lingual NLI. However, all of this work was at a relatively modest scale, in terms of the amount of training data, as compared to our approach. The benefits of scaling language model pretraining by increasing the size of the model as well as the training data has been extensively studied in the literature. For the monolingual case, Jozefowicz et al (2016) show how large-scale LSTM models can obtain much stronger performance on language modeling benchmarks when trained on billions of tokens. GPT (Radford et al, 2018) also highlights the importance of scaling the amount of data and RoBERTa (Liu et al, 2019) shows that training BERT longer on more data leads to significant boost in performance. Inspired by RoBERTa, we show that mBERT and XLM are undertuned, and that simple improvements in the learning procedure of unsupervised MLM leads to much better performance. We train on cleaned CommonCrawls (Wenzek et al, 2019), which increase the amount of data for low-resource languages by two orders of magnitude on average. Similar data has also been shown to be effective for learning high quality word embeddings in multiple languages (Grave et al, 2018).
Reference
  • Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In COLING, pages 1638–1649.
    Google ScholarLocate open access versionFindings
  • Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019.
    Findings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
    Google ScholarFindings
  • Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating crosslingual sentence representations. In EMNLP. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL.
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In LREC.
    Google ScholarFindings
  • Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. 2019. Unicoder: A universal language encoder by pretraining with multiple cross-lingual tasks. ACL.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, et al. 2017. Googles multilingual neural machine translation system: Enabling zero-shot translation. TACL, 5:339–351.
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Edouard Grave, and Piotr Bojanowski Tomas Mikolov. 2017. Bag of tricks for efficient text classification. EACL 2017, page 427.
    Google ScholarLocate open access versionFindings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
    Findings
  • Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL, pages 66–75.
    Google ScholarLocate open access versionFindings
  • Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. EMNLP.
    Google ScholarFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In NAACL, pages 260–270, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. NeurIPS.
    Google ScholarFindings
  • Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019. Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL.
    Google ScholarLocate open access versionFindings
  • Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual bert? In ACL.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. ACL.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Erik F Sang. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. CoNLL.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of contextual word embeddings, with applications to zeroshot dependency parsing. NAACL.
    Google ScholarLocate open access versionFindings
  • Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2019. Evaluating the crosslingual effectiveness of massively multilingual neural machine translation. AAAI.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. Proceedings of the 2nd Workshop on Evaluating VectorSpace Representations for NLP.
    Google ScholarLocate open access versionFindings
  • Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Emerging cross-lingual structure in pretrained language models. ACL.
    Google ScholarLocate open access versionFindings
  • Shijie Wu and Mark Dredze. 2019.
    Google ScholarFindings
  • Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848.
    Findings
  • Jasdeep Singh, Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2019. Xlda: Cross-lingual data augmentation for natural language inference and question answering. arXiv preprint arXiv:1905.11471.
    Findings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642.
    Google ScholarLocate open access versionFindings
  • Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and TieYan Liu. 2019. Multilingual neural machine translation with knowledge distillation. ICLR.
    Google ScholarLocate open access versionFindings
  • Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: languageindependent named entity recognition. In CoNLL, pages 142–147. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
    Findings
  • Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman, Armand Joulin, and Edouard Grave. 2019. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.
    Findings
Full Text
Your rating :
0

 

Tags
Comments