A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

ACL, pp. 1703-1714, 2020.

被引用0|浏览61
EI
微博一下
We see that in every single case the UDPipe 2.0 + ELMoOSCAR result surpasses the UDPipe 2.0 + ELMoWikipedia one, suggesting that the size of the pre-training data plays an important role in downstream task results

摘要

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for several mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech...更多

代码

数据

ZH
下载 PDF 全文
引用
微博一下
简介
  • One of the key elements that has pushed the state of the art considerably in neural NLP in recent years has been the introduction and spread of transfer learning methods to the field.
  • The authors train OSCAR-based and Wikipedia-based ELMo contextualized word embeddings (Peters et al, 2018) for 5 languages: Bulgarian, Catalan, Danish, Finnish and Indonesian.
重点内容
  • One of the key elements that has pushed the state of the art considerably in neural NLP in recent years has been the introduction and spread of transfer learning methods to the field
  • On the other hand, contextualized word representations and language models have been developed using both feature-based architectures, the most notable examples being Embeddings from Language Models and Flair (Peters et al, 2018; Akbik et al, 2018), and transformer based architectures, that are commonly used in a fine-tune setting, as is the case of GPT-1, GPT-2 (Radford et al, 2018, 2019), BERT and its derivatives (Devlin et al, 2018; Liu et al, 2019; Lan et al, 2019) and more recently T5 (Raffel et al, 2019)
  • We show that the models using the OSCAR-based Embeddings from Language Models embeddings consistently outperform the Wikipedia-based ones, suggesting that big highcoverage noisy corpora might be better than small high-quality narrow-coverage corpora for training contextualized language representations4
  • In Section 4 we examine and describe in detail the model used for our contextualized word representations, as well as the parser and the tagger we chose to evaluate the impact of corpora in the embeddings’ performance in downstream tasks
  • The main goal of this paper is to show the impact of training data on contextualized word representations when applied in particular downstream tasks
  • We see that in every single case the UDPipe 2.0 + ELMoOSCAR result surpasses the UDPipe 2.0 + ELMoWikipedia one, suggesting that the size of the pre-training data plays an important role in downstream task results
结果
  • The authors show that the models using the OSCAR-based ELMo embeddings consistently outperform the Wikipedia-based ones, suggesting that big highcoverage noisy corpora might be better than small high-quality narrow-coverage corpora for training contextualized language representations4 .
  • 1https://commoncrawl.org 2https://oscar-corpus.com 3Snapshot from November 2018 4Both the Wikipedia- and the OSCAR-based embeddings for these 5 languages are available at: establish a new state of the art for both POS tagging and dependency parsing in 6 different treebanks covering all 5 languages.
  • The authors train ELMo contextualized word embeddings for 5 languages: Bulgarian, Catalan, Danish, Finnish and Indonesian.
  • The authors train different versions of the Embeddings from Language Models (ELMo) (Peters et al, 2018) for both the Wikipedia and OSCAR corpora, for each of the selected 5 languages.
  • The authors train ELMo models for Bulgarian, Catalan, Danish, Finnish and Indonesian using the OSCAR corpora on the one hand and the Wikipedia
  • As previously mentioned Finnish is morphologically richer than the other languages in which the authors trained ELMo, the authors hypothesize that the representation space given by the ELMo embeddings might not be sufficiently big to extract more features from the Finnish OSCAR corpus beyond the 5th epoch mark, in order to test this the authors would need to train a larger language model like BERT which is sadly beyond the computing infrastructure limits.
  • Considering the discussion above, the authors believe an interesting follow-up to the experiments would be training the ELMo models for more of the languages included in the OSCAR corpus.
  • The total CO2 emissions in kg for one single model can be computed as: In this paper, the authors have explored the use of the Common-Crawl-based OSCAR corpora to train ELMo contextualized embeddings for five typologically diverse mid-resource languages.
结论
  • The authors have compared them with Wikipedia-based ELMo embeddings on two classical NLP tasks, POS tagging and parsing, using state-of-the-art neural architectures.
  • The authors' experiments show that Common-Crawltecture, as each model took less than 4 hours to based data such as the OSCAR corpus can be used train on a machine using a single NVIDIA Tesla to train high-quality contextualized embeddings, V100 card.
总结
  • One of the key elements that has pushed the state of the art considerably in neural NLP in recent years has been the introduction and spread of transfer learning methods to the field.
  • The authors train OSCAR-based and Wikipedia-based ELMo contextualized word embeddings (Peters et al, 2018) for 5 languages: Bulgarian, Catalan, Danish, Finnish and Indonesian.
  • The authors show that the models using the OSCAR-based ELMo embeddings consistently outperform the Wikipedia-based ones, suggesting that big highcoverage noisy corpora might be better than small high-quality narrow-coverage corpora for training contextualized language representations4 .
  • 1https://commoncrawl.org 2https://oscar-corpus.com 3Snapshot from November 2018 4Both the Wikipedia- and the OSCAR-based embeddings for these 5 languages are available at: establish a new state of the art for both POS tagging and dependency parsing in 6 different treebanks covering all 5 languages.
  • The authors train ELMo contextualized word embeddings for 5 languages: Bulgarian, Catalan, Danish, Finnish and Indonesian.
  • The authors train different versions of the Embeddings from Language Models (ELMo) (Peters et al, 2018) for both the Wikipedia and OSCAR corpora, for each of the selected 5 languages.
  • The authors train ELMo models for Bulgarian, Catalan, Danish, Finnish and Indonesian using the OSCAR corpora on the one hand and the Wikipedia
  • As previously mentioned Finnish is morphologically richer than the other languages in which the authors trained ELMo, the authors hypothesize that the representation space given by the ELMo embeddings might not be sufficiently big to extract more features from the Finnish OSCAR corpus beyond the 5th epoch mark, in order to test this the authors would need to train a larger language model like BERT which is sadly beyond the computing infrastructure limits.
  • Considering the discussion above, the authors believe an interesting follow-up to the experiments would be training the ELMo models for more of the languages included in the OSCAR corpus.
  • The total CO2 emissions in kg for one single model can be computed as: In this paper, the authors have explored the use of the Common-Crawl-based OSCAR corpora to train ELMo contextualized embeddings for five typologically diverse mid-resource languages.
  • The authors have compared them with Wikipedia-based ELMo embeddings on two classical NLP tasks, POS tagging and parsing, using state-of-the-art neural architectures.
  • The authors' experiments show that Common-Crawltecture, as each model took less than 4 hours to based data such as the OSCAR corpus can be used train on a machine using a single NVIDIA Tesla to train high-quality contextualized embeddings, V100 card.
表格
  • Table1: Size of Wikipedia corpora, measured in bytes, thousands of tokens, words and sentences
  • Table2: Size of OSCAR subcorpora, measured in bytes, thousands of tokens, words and sentences
  • Table3: Number of out-of-vocabulary words in random samples of 1M words for OSCAR and Wikipedia
  • Table4: Size of treebanks, measured in thousands of tokens and sentences
  • Table5: Scores from UDPipe 2.0 (from Kondratyuk and Straka, 2019), the previous stateof-the-art models UDPipe 2.0+mBERT (<a class="ref-link" id="cStraka_et+al_2019_a" href="#rStraka_et+al_2019_a">Straka et al, 2019</a>) and UDify (Kondratyuk and Straka, 2019), and our ELMo-enhanced UDPipe 2.0 models. Test scores are given for UPOS, UAS and LAS in all five languages. Best scores are shown in bold, second best scores are underlined
  • Table6: UPOS, UAS and LAS scores for the UDPipe 2.0 baseline reported by (Kondratyuk and Straka, 2019), plus the scores for checkpoints at 1, 3, 5 and 10 epochs for all the ELMoOSCAR and ELMoWikipedia. All scores are test scores. Best ELMoOSCAR scores are shown in bold while best ELMoWikipedia scores are underlined
  • Table7: Average power draw (Watts), training times (in both hours and days), mean power consumption (KWh) and CO2 emissions (kg) for each ELMo model trained
  • Table8: Number of training steps for each checkpoint, for the ELMoWikipedia and ELMoOSCAR of each language
Download tables as Excel
相关工作
  • Since the introduction of word2vec (Mikolov et al, 2013), many attempts have been made to create multilingual language representations; for fixed word embeddings the most remarkable works are those of (Al-Rfou et al, 2013) and (Bojanowski et al, 2017) who created word embeddings for a large quantity of languages using Wikipedia, and later (Grave et al, 2018) who trained the fastText word embeddings for 157 languages using Common Crawl and who in fact showed that using crawled data significantly increased the performance of the embeddings especially for mid- to low-resource languages.

    Regarding contextualized models, the most notable non-English contribution has been that of the mBERT (Devlin et al, 2018), which is distributed as (i) a single multilingual model for 100 different languages trained on Wikipedia data, and as (ii) a single multilingual model for both Simplified and Traditional Chinese. Four monolingual fully trained ELMo models have been distributed for Japanese, Portuguese, German and Basque5; 44 monolingual ELMo models6 where also released by the HIT-SCIR team (Che et al, 2018) during the CoNLL 2018 Shared Task (Zeman et al, 2018), but their training sets where capped at 20 million words. A German BERT (Chan et al, 2019) as well as a French BERT model (called CamemBERT) (Martin et al, 2019) have also been released. In general no particular effort in creating https://oscar-corpus.com/#models. 5https://allennlp.org/elmo 6https://github.com/HIT-SCIR/ELMoForManyLangs a set of high-quality monolingual contextualized representations has been shown yet, or at least not on a scale that is comparable with what was done for fixed word embeddings.

    For dependency parsing and POS tagging the most notable non-English specific contribution is that of the CoNLL 2018 Shared Task (Zeman et al, 2018), where the 1st place (LAS Ranking) was awarded to the HIT-SCIR team (Che et al, 2018) who used Dozat and Manning (2017)’s Deep Biaffine parser and its extension described in (Dozat et al, 2017), coupled with deep contextualized ELMo embeddings (Peters et al, 2018) (capping the training set at 20 million words). The 1st place in universal POS tagging was awarded to Smith et al (2018) who used two separate instances of Bohnet et al (2018)’s tagger.
基金
  • This work was partly funded by the French national ANR grant BASNUM (ANR-18-CE38-0003), as well as by the last author’s chair in the PRAIRIE institute,22 funded by the French national ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001
引用论文
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Multilingual BERT.
    Google ScholarFindings
  • Blythe, https://github.com/google-research/bert/blob/master/
    Findings
  • 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638–1649. Association for Computational
    Google ScholarLocate open access versionFindings
  • Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • ozzi, and Steven Skiena. 2013.
    Google ScholarFindings
  • Stanford’s graph-based neural dependency parser at the CoNLL 2017
    Google ScholarFindings
  • Polyglot: Distributed word representations for multilingualINn LPPr.oceedings of the CoNLL 2017 Shared Task: In Proceedings of the Seventeenth Conference on
    Google ScholarLocate open access versionFindings
  • 183–192, Sofia, Bulgaria. Association for Computa-Association for Computational Linguistics.
    Google ScholarFindings
  • Armand Joulin, and Tomas Mikolov. 2018.
    Google ScholarFindings
  • Pitler, and Joshua Maynez. 2018.
    Google ScholarFindings
  • Morphosyntactic tagging with a meta-BiLSTM model In Proceedings of the 56th Annual Meeting of the overPEcrvooancltueeaexdttiiosnengnsCsiootinfvfetehrteeonk1ce1ent,heMnLciayonadzginaugkasig,. eJaRpeasno. uEruceros paenadn Language Resource Association.
    Google ScholarLocate open access versionFindings
  • Association for Computational Linguistics (Volume
    Google ScholarFindings
  • 1: Long Papers), pages 2642–2652, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information.
    Google ScholarFindings
  • Transactions of the Association for Computational Linguistics, 5:135–146.
    Google ScholarLocate open access versionFindings
  • Jégou, and Tomas Mikolov. 2016.
    Google ScholarFindings
  • Matthias Buch-Kromann. 2003. The danish dependency treebank and the dtag treebank tool. In 2nd Workshop on Treebanks and Linguistic Theories (TLT), Sweden, pages 217–220.
    Google ScholarLocate open access versionFindings
  • Branden Chan, Timo Möller, Malte Pietsch, Tanay Soni, and Chin Man Yeung. 20German BERT. https://deepset.ai/german-bert.
    Findings
  • Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European
    Google ScholarLocate open access versionFindings
  • Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational
    Google ScholarFindings
  • Bo Zheng, Towards better UD In Proceedings of and Ting Liu. 2018. parsing: Deep contextualized word the CoNLL 2018 Shared Task: eDmab7ne5ddLiaKnngogsnu,daergnaesteys,mu1kbMle,oaadnneddl:tPreaeMrbsiianlnagnkUconniSvctearratsekanala.Dtioepn2.e0n1d9e.ncies arXiv e-prints, page arXiv:1904.02099.
    Findings
  • Dependencies, pages 55–64, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarFindings
  • 2016. A validation of dram rapl power measurements.
    Google ScholarFindings
  • Sharma, and Radu Soricut. 2019.
    Google ScholarFindings
  • ALBERT: A Lite BERT for Self-supervised Learning of Language Re arXiv e-prints, page arXiv:1909.11942. In Proceedings of the Second International Symposium on Memory Systems, MEMSYS ’16, page 455–470, New York, NY, USA. Association for Computing Machinery.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach.
    Google ScholarFindings
  • Djamé Seddah, and Benoît Sagot. 2019.
    Google ScholarFindings
  • Using Wikipedia for automatic word sense disambiguation.Li, KyungTae Lim, Nikola Ljubešic, Olga Loginova, In Human Language Technologies 2007: The Con-Olga Lyashevskaya, Teresa Lynn, Vivien Mackeference of the North American Chapter of the tanz, Aibek Makazhanov, Michael Mandl, Christo-Association for Computational Linguistics; Propher Manning, Ruli Manurung, Catalina Maranceedings of the Main Conference, pages 196–203, duc, David Marecek, Katrin Marheinecke, Héc-Rochester, New York. Association for Computator Martínez Alonso, André Martins, Jan Mašek, tional Linguistics.
    Google ScholarLocate open access versionFindings
  • Christian Puhrsch, and Armand Joulin. 2018. Ad-More, Laura Moreno Romero, Shinsuke Mori, Bjarvances in pre-training distributed word representatur Mortensen, Bohdan Moskalevskyi, Kadri Muistions. In Proceedings of the Eleventh International chnek, Yugo Murawaki, Kaili Müürisep, Pinkey
    Google ScholarLocate open access versionFindings
  • LREC 2018, Miyazaki, Japan, May 7-12, 2018.
    Google ScholarFindings
  • Dean. 2013. and phrases and their cOoovjmaa,lpaRo, soAibtdieoérndtaaÖlyiòts.ytl.Oinlgú,òkLuilnj,aMØavireOlimd,urNa,ikPoetPyaartOanseenn-, In Proceedings of the 26th International Conference
    Google ScholarLocate open access versionFindings
  • Yu, Zdenek Žabokrtský, Amir Zeldes, Daniel Zeman, Manying Zhang, and Hanzhi Zhu. 2018. Universal dependencies 2.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied
    Google ScholarFindings
  • Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. Challenges in the Management of Large Corpora (CMLC-7) 2019, page 9.
    Google ScholarLocate open access versionFindings
  • 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018.
    Google ScholarFindings
  • Deep contextualized word representations. In Proceedings of the 2018 Conference of the North
    Google ScholarLocate open access versionFindings
  • American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational
    Google ScholarFindings
  • Linguistics. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Milan Straka, Jana Straková, and Jan Hajic. 2019. Evaluating contextualized embeddings on 54 languages in POS taggin CoRR, abs/1908.07448.
    Findings
  • 2019. Energy and policy considerations for deep learning in NLP.
    Google ScholarFindings
  • In Proceedings of the 57th Annual Meeting of the
    Google ScholarLocate open access versionFindings
  • Association for Computational Linguistics, pages
    Google ScholarFindings
  • 3645–3650, Florence, Italy. Association for Computational Linguistics.
    Google ScholarFindings
  • Martí, and Marta Recasens. 2008.
    Google ScholarFindings
  • Ancora: Multilevel annotated corpora for catalan and spanish. In Proceedings of the International Conference on
    Google ScholarLocate open access versionFindings
  • Language Resources and Evaluation, LREC 2008, 26
    Google ScholarFindings
  • May - 1 June 2008, Marrakech, Morocco.
    Google ScholarFindings
  • European Language Resources Association.
    Google ScholarFindings
  • Slav Petrov, Dipanjan Das, and Ryan T. McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on
    Google ScholarLocate open access versionFindings
  • Trieu H. Trinh and Quoc V. Le. 2018. A simple method for commonsense reasoning. CoRR, abs/1806.02847.
    Findings
  • Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2089– 2096. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Blog.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1:8.
    Google ScholarLocate open access versionFindings
  • Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2019. CCNet: Extracting High Quality Monolingual Datasets from Web Cra arXiv e-prints, page arXiv:1911.00359.
    Findings
  • Fei Wu and Daniel S. Weld. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 118–127, Uppsala, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao, and Sara Stymne. 2018.
    Google ScholarFindings
  • Dependencies, pages 1–21, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarFindings
  • 82 treebanks, 34 models: Universal dependency parsing with multi-treebank models. In Proceedings of the CoNLL 2018 Shared Task: A Appendix
    Google ScholarLocate open access versionFindings
  • 2018. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Milan Straka and Jana Straková. 2017.
    Google ScholarFindings
您的评分 :
0

 

标签
评论