AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual

Identifying Elements Essential for BERT’s Multilinguality

empirical methods in natural language processing, pp.4423-4437, (2020)

Cited by: 0|Views14
Full Text
Bibtex
Weibo

Abstract

It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat ob...More

Code:

Data:

0
Introduction
  • Multilingual models, i.e., models capable of processing more than one language with comparable performance, are central to natural language processing.
  • They are useful as fewer models need to be maintained to serve many languages, resource requirements are reduced, and low- and mid-resource languages can benefit from crosslingual transfer.
  • Multilingual models are useful in machine translation, zero-shot task transfer and typological research.
Highlights
  • Multilingual models, i.e., models capable of processing more than one language with comparable performance, are central to natural language processing
  • Multilingual models are useful in machine translation, zero-shot task transfer and typological research
  • We show that having identical structure across languages, but an inverted word order in one language destroys multilinguality
  • We evaluate two properties of our trained language models: the degree of multilinguality and – as a consistency check – the overall model fit
  • The main takeaways are: i) Shared position embeddings, shared special tokens, replacing masked tokens with random tokens and a limited amount of parameters are necessary elements for multilinguality. ii) Word order is relevant: BERT is not multilingual with one language having an inverted word order. iii) The comparability of training corpora contributes to multilinguality
  • Perplexity is computed on 15% of randomly selected tokens that are replaced by “[MASK]”
  • We considered a fully unsupervised setting without any crosslingual signals
Results
  • Table 1 shows results.
  • Each model has an associated ID that is consistent with the code.
  • The original model (ID 0) shows a high degree of multilinguality.
  • Alignment is an easy task with shared position embeddings yielding F1 = 1.00.
  • Retrieval works better with contextualized representations on layer 8 (.97 vs .16) whereas word translation works better on layer 0 (.88 vs .79), as expected.
  • The untrained BERT models perform poorly (IDs 18, 19), except for word alignment with shared position embeddings
Conclusion
  • The authors investigated which architectural and linguistic properties are essential for BERT to yield crosslingual representations.
  • Ii) Word order is relevant: BERT is not multilingual with one language having an inverted word order.
  • The authors experimented with a simple modification to obtain stronger multilinguality in BERT models and demonstrate its effectiveness on XNLI.
  • The authors considered a fully unsupervised setting without any crosslingual signals.
  • In future work the authors plan to incorporate crosslingual signals as Vulicet al.
  • (2019) argue that a fully unsupervised setting is hard to motivate
  • In future work the authors plan to incorporate crosslingual signals as Vulicet al. (2019) argue that a fully unsupervised setting is hard to motivate
Summary
  • Introduction:

    Multilingual models, i.e., models capable of processing more than one language with comparable performance, are central to natural language processing.
  • They are useful as fewer models need to be maintained to serve many languages, resource requirements are reduced, and low- and mid-resource languages can benefit from crosslingual transfer.
  • Multilingual models are useful in machine translation, zero-shot task transfer and typological research.
  • Objectives:

    The authors aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual.
  • Results:

    Table 1 shows results.
  • Each model has an associated ID that is consistent with the code.
  • The original model (ID 0) shows a high degree of multilinguality.
  • Alignment is an easy task with shared position embeddings yielding F1 = 1.00.
  • Retrieval works better with contextualized representations on layer 8 (.97 vs .16) whereas word translation works better on layer 0 (.88 vs .79), as expected.
  • The untrained BERT models perform poorly (IDs 18, 19), except for word alignment with shared position embeddings
  • Conclusion:

    The authors investigated which architectural and linguistic properties are essential for BERT to yield crosslingual representations.
  • Ii) Word order is relevant: BERT is not multilingual with one language having an inverted word order.
  • The authors experimented with a simple modification to obtain stronger multilinguality in BERT models and demonstrate its effectiveness on XNLI.
  • The authors considered a fully unsupervised setting without any crosslingual signals.
  • In future work the authors plan to incorporate crosslingual signals as Vulicet al.
  • (2019) argue that a fully unsupervised setting is hard to motivate
  • In future work the authors plan to incorporate crosslingual signals as Vulicet al. (2019) argue that a fully unsupervised setting is hard to motivate
Tables
  • Table1: Multilinguality and model fit for our models. Mean and standard deviation (subscript) across 5 different random seeds is shown. ID is a unique identifier for the model setting. To put perplexities into perspective: the pretrained mBERT has a perplexity of roughly 46 on train and dev. knn-replace is explained in §4
  • Table2: Results showing the effect of having a parallel vs. non-parallel training corpus
  • Table3: Accuracy on XNLI test for different model settings. Shown is the mean and standard deviation (subscript) across three random seeds. All models have the same architecture as BERT-base, are pretrained on Wikipedia data and finetuned on English XNLI training data. mBERT was pretrained longer and on much more data and has thus higher performance. Best nonmBERT performance in bold
  • Table4: Kendall’s Tau word order metric and XNLI zero-shot accuracies
  • Table5: Runtime on a single GPU
  • Table6: Number of parameters for our used models
  • Table7: Overview on datasets
  • Table8: Overview on third party systems used
  • Table9: Model and training parameters during pretraining
  • Table10: Even when continuing the training for a long time overparameterized models with architectural modifications do not become multilingual
Download tables as Excel
Related work
  • There is a range of prior work analyzing the reason for BERT’s multilinguality. Singh et al (2019) show that BERT stores language representations in different subspaces and investigate how subword tokenization influences multilinguality. Artetxe et al (2020) show that neither a shared vocabulary nor joint pretraining is essential for multilinguality. K et al (2020) extensively study reasons for multilinguality (e.g., researching depth, number of parameters and attention heads). They conclude that depth is essential. They also investigate language properties and conclude that structural similarity across languages is important, without further defining this term. Last, Conneau et al (2020b) find that a shared vocabulary is not required. They find that shared parameters in the top layers are required for multilinguality. Further they show that different monolingual BERT models exhibit a similar structure and thus conclude that mBERT somehow aligns those isomorphic spaces. They investigate having separate embedding look-ups per language (including position embeddings and special tokens) and a variant of avoiding cross-language replacements. Their method “extra anchors” yields a higher degree of multilinguality. In contrast to this prior work, we investigate multilinguality in a clean laboratory setting, investigate the interaction of architectural aspects and research new aspects such as overparameterization or inv-order.
Funding
  • We gratefully acknowledge funding through a Zentrum Digitalisierung.Bayern fellowship awarded to the first author
  • This work was supported by the European Research Council (# 740516)
Reference
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
    Google ScholarLocate open access versionFindings
  • Alexandra Birch and Miles Osborne. 2011. Reordering metrics for MT. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1027–1035, Portland, Oregon, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
    Google ScholarLocate open access versionFindings
  • Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Multilingual alignment of contextual word representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, XianLing Mao, Heyan Huang, and Ming Zhou. 2020. Infoxlm: An information-theoretic framework for cross-lingual language model pre-training. arXiv preprint arXiv:2007.07834.
    Findings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020a. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau and Guillaume Lample. 201Crosslingual language model pretraining. In Advances in Neural Information Processing Systems 32, pages 7059–7069. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2020b. Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6022–6034, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 462–471, Gothenburg, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann and Phil Blunsom. 20Multilingual models for compositional distributed semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 58–68, Baltimore, Maryland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080.
    Findings
  • Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. 2019. Unicoder: A universal language encoder by pretraining with multiple cross-lingual tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2485–2494, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. 2020. Cross-lingual ability of multilingual bert: An empirical study. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Anne Lauscher, Vinit Ravishankar, Ivan Vulic, and Goran Glavas. 20From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633.
    Findings
  • Jindrich Libovicky, Rudolf Rosa, and Alexander Fraser. 2019. How language-neutral is multilingual bert? arXiv preprint arXiv:1911.03310.
    Findings
  • Edward Loper and Steven Bird. 2002. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 63–70, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3158– 3163, Reykjavik, Iceland. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Samuel Ronnqvist, Jenna Kanerva, Tapio Salakoski, and Filip Ginter. 2019. Is multilingual BERT fluent in language generation? In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, pages 29–36, Turku, Finland. Linkoping University Electronic Press.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder, Ivan Vulic, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models. J. Artif. Int. Res., 65(1):569–630.
    Google ScholarLocate open access versionFindings
  • Masoud Jalili Sabet, Philipp Dufter, and Hinrich Schutze. 2020. Simalign: High quality word alignments without parallel training data using static and contextualized embeddings. arXiv preprint arXiv:2004.08728.
    Findings
  • Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
    Google ScholarLocate open access versionFindings
  • Jasdeep Singh, Bryan McCann, Richard Socher, and Caiming Xiong. 2019. BERT is not an interlingua and the bias of tokenization. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 47–55, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
    Findings
  • Phoebe Mulcaire, Jungo Kasai, and Noah A. Smith. 2019. Polyglot contextual representations improve crosslingual transfer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3912–3918, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Sebastian Ruder. 2020. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052.
    Findings
  • Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996– 5001, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ivan Vulic, Goran Glavas, Roi Reichart, and Anna Korhonen. 2019. Do we really need fully unsupervised cross-lingual embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4407–4418, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shijie Wu and Mark Dredze. 2019.
    Google ScholarFindings
  • Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wei Zhao, Goran Glavas, Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. 2020. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1656– 1671, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • We use the training data to train static word embeddings for each language using the tool fastText. Subsequently we use VecMap (Artetxe et al., 2018) to map the embedding spaces from each language into the English embedding space, thus creating a multilingual static embedding space. We use VecMap without any supervision.
    Google ScholarFindings
Author
Philipp Dufter
Philipp Dufter
Your rating :
0

 

Tags
Comments
小科