AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showed that this empirically motivated clustering-based method consistently outperforms the standard vocabulary generation recipe used by most multilingual pretrained language modeling work without increasing the model size, compute or data

Improving Multilingual Models with Language Clustered Vocabularies

EMNLP 2020, pp.4536-4546, (2020)

Cited by: 0|Views232
Full Text
Bibtex
Weibo

Abstract

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation th...More

Code:

Data:

0
Introduction
  • Multilingual models such as mBERT (Devlin et al, 2019), XLM (Lample and Conneau, 2019), and XLM-R (Conneau et al, 2020) have built on the advances of deep contextualized language modeling by pretraining on texts from many languages at once.
  • The multilingual subword vocabularies used by the state-of-the-art models are generated by algorithms such as WordPiece (Schuster and Nakajima, 2012; Wu et al, 2016), SentencePiece (Kudo and Richardson, 2018),1 or Byte Pair Encoding (BPE) (Sennrich et al, 2016).
  • Given a desired vocabulary size, these algorithms select an inventory of subwords that compactly represents the training corpora, which means preferring subwords that occur frequently, and, by extension for multilingual models, occur frequently across languages
Highlights
  • Multilingual models such as mBERT (Devlin et al, 2019), XLM (Lample and Conneau, 2019), and XLM-R (Conneau et al, 2020) have built on the advances of deep contextualized language modeling by pretraining on texts from many languages at once
  • Conneau et al (2020) showed that increasing the vocabulary size can produce quality gains, but unlike similar monolingual models, the vocabulary embedding matrix in each of these multilingual models constitutes a significant fraction of its total parameters; for example, 47% of XLM-R’s parameters are in its embedding matrix
  • We propose a novel approach to multilingual vocabulary generation that seeks to balance the trade-off between optimizing for crosslingual subword sharing and the need for robust representation of individual languages
  • We directly examine the vocabularies produced by the standard recipe (JOINT) and our clustering-based method (CLUSTER) in order to assess the degree to which each approach is able to capture and balance the inductive biases introduced in §1: a multilingual vocabulary should encourage subword sharing across languages when appropriate, but each language should have the freedom to contribute the subwords that are most effective for its own representation
  • We describe a novel clustering-based multilingual vocabulary generation algorithm
  • Our method improves performance without any changes to the model, and since it does not depend on the model architecture, it can be applied to any model that uses a vocabulary
  • We showed that this empirically motivated clustering-based method consistently outperforms the standard vocabulary generation recipe used by most multilingual pretrained language modeling work without increasing the model size, compute or data
Methods
  • The principal goal of this work is to investigate the effect of improved vocabulary composition on multilingual models.
  • The authors keep the number of languages constant, since perlanguage model capacity is known to affect the performance of multilingual models as shown in Conneau et al (2020).
  • The authors keep the number of parameters constant3, including the vocabulary size, since the performance of Transformerbased (Vaswani et al, 2017) models is strongly correlated with number of parameters (Lepikhin et al, 2020; Kaplan et al, 2020; Raffel et al, 2020; Conneau et al, 2020; Brown et al, 2020).
Results
  • The authors' method improves performance without any changes to the model, and since it does not depend on the model architecture, it can be applied to any model that uses a vocabulary.
Conclusion
  • The authors describe a novel clustering-based multilingual vocabulary generation algorithm. The authors showed that this empirically motivated clustering-based method consistently outperforms the standard vocabulary generation recipe used by most multilingual pretrained language modeling work without increasing the model size, compute or data.
Tables
  • Table1: TYDI QA results for our k-means-based vocabulary generation approach on different values of k. k = 1 puts all languages in a single cluster, and is thus equivalent to the baseline JOINT approach; k = 104 generates a separate vocabulary for each language
  • Table2: Cluster definitions of CLUSTER
  • Table3: Wasserstein-1 distance (×1000)
  • Table4: Percentage of each vocabulary’s subwords that contain CJK, or Arabic script characters
  • Table5: Results on the TYDI QA primary tasks: minimal answer span (MINSPAN) and passage selection (SELECTP). The final column (Avg) is the macro average excluding English, following <a class="ref-link" id="cClark_et+al_2020_a" href="#rClark_et+al_2020_a">Clark et al (2020</a>)
  • Table6: XNLI accuracies
  • Table7: Results for zero-shot NER using cross-lingual transfer from English (following XTREME (Hu et al.)) for a sample of languages, grouped according to the clustering used by CLUSTER. All scores are labeled span F1, and Avg is the macro average across all 40 XTREME languages
  • Table8: Average description length and OOV rate. These are computed on the pretraining data with the same sampling strategies used for pretraining
  • Table9: Comparisons with a full-scale model using our approach. Scores are macro averages over languages. TYDI QA numbers are in (MINSPAN/SELECTP) format, and our results are on the test set via a submission to the official leaderboard. Baseline numbers for TYDI QA are from <a class="ref-link" id="cClark_et+al_2020_a" href="#rClark_et+al_2020_a">Clark et al (2020</a>) and the rest are from Hu et al
  • Table10: XNLI accuracy on development set. The best performing hyperparameters for both JOINT and CLUSTER were learning rate of 2 × 10−5, batch size of 32, and 3 training epochs
  • Table11: WikiAnn NER F1 scores on development set. We used 4 × 10−5, batch size of 32 and 2 training epochs for both CLUSTER and JOINT
  • Table12: Full-scale model’s result on the TYDI QA primary tasks (development set)
  • Table13: XNLI (development set) result for the full-scale model. The best performing model has learning rate of 0.0001, train epochs of 9 and batch size of 512
  • Table14: WikiAnn NER F1 scores of the full-scale model on development set. The best performing model has learning rate of 3 × 10−5, batch size of 32 and trained for 2 epochs
  • Table15: Number of examples in training, development, test set splits for each evaluation datasets
  • Table16: WikiAnn NER results in Table 7 on all 40 languages. The average F1 scores are 61.7 and 64.5 for JOINT and CLUSTER, respectively
  • Table17: WikiAnn NER results in Table 9 on all 40 languages. The average F1 score is 73.6
  • Table18: List of languages used in the pre-training
Download tables as Excel
Funding
  • Our method improves performance without any changes to the model, and since it does not depend on the model architecture, it can be applied to any model that uses a vocabulary
Study subjects and analysis
distinct datasets: 3
In particular, we keep the number of languages constant, since perlanguage model capacity is known to affect the performance of multilingual models as shown in Conneau et al (2020). In addition, we keep the number of parameters constant3, including the vocabulary size, since the performance of Transformerbased (Vaswani et al, 2017) models is strongly correlated with number of parameters (Lepikhin et al, 2020; Kaplan et al, 2020; Raffel et al, 2020; Conneau et al, 2020; Brown et al, 2020).

In order to demonstrate the effectiveness of our approach across languages and downstream tasks, we evaluate our method on three distinct datasets:

• TYDI QA (Clark et al, 2020): question answering in 10 languages
. Results in Table 5

distinct datasets: 3
In addition, we keep the number of parameters constant3, including the vocabulary size, since the performance of Transformerbased (Vaswani et al, 2017) models is strongly correlated with number of parameters (Lepikhin et al, 2020; Kaplan et al, 2020; Raffel et al, 2020; Conneau et al, 2020; Brown et al, 2020). In order to demonstrate the effectiveness of our approach across languages and downstream tasks, we evaluate our method on three distinct datasets:. • TYDI QA (Clark et al, 2020): question answering in 10 languages

Reference
  • Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. arXiv e-prints, page arXiv:1907.05019.
    Findings
  • Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCand lish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv e-prints, page arXiv:2005.14165.
    Findings
  • Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics.
    Google ScholarFindings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sharon J. Goldwater. 200Nonparametric Bayesian Models of Lexical Acquisition. Ph.D. thesis, Brown University.
    Google ScholarFindings
  • XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In International Conference on Machine Learning, ICML 2020.
    Google ScholarLocate open access versionFindings
  • Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. 2020. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In Proc. of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv e-prints, page arXiv:2001.08361.
    Findings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS).
    Google ScholarLocate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv eprints, page arXiv:2006.16668.
    Findings
  • Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Crosslingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-totext transformer. Journal of Machine Learning Research, 21(140):1–67.
    Google ScholarLocate open access versionFindings
  • Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019. Massively multilingual transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151–164, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jorma Rissanen. 1989. Stochastic Complexity In Statistical Inquiry. World Scientific Publishing Co., Singapore.
    Google ScholarFindings
  • Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152. IEEE.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Shijie Wu and Mark Dredze. 2019.
    Google ScholarFindings
  • Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv e-prints, page arXiv:1609.08144.
    Findings
  • Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • We did not use any hyperparameter search for pretraining. We use LAMB optimizer (You et al., 2020) with batch size of 4096. You et al. (2020) also recommended learning rate of 0.0018, warmup proportion of 2.5% of the total number of steps. We use linear warm-up and linear learning rate decay down to 0 in the last step. We use gradient clipping with a norm of 1.0.
    Google ScholarLocate open access versionFindings
  • where nl is the number of sentences in language l’s corpus. Then we use the exponential smoothing value of 0.7 following (Devlin et al., 2019), i.e., we exponentiate pl and renormalize to compute the sampling probabilities of each language.
    Google ScholarLocate open access versionFindings
  • Except for the full-scale model, all models were 12 layers of Transformers, with hidden size of 768 and 12 attention heads. For faster experimentation, we used an embedding size of 128, similar to Lan et al. (2020). The total number of parameters is 150M. The full-scale model has 24 Transformer layers, hidden size of 1024, and 16 attention heads. We used an embedding size of 512, totaling 550M parameters. We chose this number of parameters to mimic XLM-R.
    Google ScholarLocate open access versionFindings
  • We ran experiments with two seed values and chose the best model based on the average of the two runs. For fine-tuning, we used Adam optimizer (Kingma and Ba, 2015).
    Google ScholarFindings
  • For WikiAnn NER, we used a learning rate of 4 × 10−5, a batch size of 32, and 2 training epochs. We found that the performance is robust with respect to the set of hyperparameters, so we did not change this setting. The training was run with 4 TPUs which took about one hour to finish. The evaluation metric is span-level F1 score. Our evaluation code was tested against the seqeval library: https://github.com/chakki-works/seqeval, and produces the same scores.
    Findings
  • For TYDI QA, we found that larger batch sizes improved the training stability, so we used a batch size of 512. With the larger batch size, longer epochs were helpful, so we used a grid search over the learning rate of [3 × 10−5, 4 × 10−5, 5 × 10−5] and training epochs of [7, 8, 9]. We chose the best model based on the macro-averaged F1 score on 10 languages, excluding English following Clark et al. (2020). For the hyperparameters that are specific to TYDI QA, we used the same settings as the baseline model from (Clark et al., 2020): 45 maximum passages, 0.1 include unknown rates, sequence length of 512, and window stride of 128. The evaluation metric is F1 score, which we computed with the official evaluation script from https://github.com/google-researchar bg de el en es fr hi ru sw th tr ur vi zh Avg
    Locate open access versionFindings
Author
Dan Garrette
Dan Garrette
Kiat Chuan Tan
Kiat Chuan Tan
Jason Riesa
Jason Riesa
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科