X FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

EMNLP 2020, pp. 5943-5959, 2020.

Other Links: academic.microsoft.com
Weibo:
We examine the intersection of multilinguality and the factual knowledge included in Language models by creating a multilingual and multi-token benchmark XFACTR, and performing experiments comparing and contrasting across languages and LMs

Abstract:

Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as “Punta Cana is located in _.” However, while knowledge is both written and queried in many languages, studies on LMs’ factual representation ability have almost invariably been performed on...More

Code:

Data:

0
Introduction
  • Language models (LMs; (Church, 1988; Kneser and Ney, 1995; Bengio et al, 2003)) learn to model the probability distribution of text, and in doing so capture information about various aspects of the syntax or semantics of the language at hand.
  • Recent works have presented intriguing results demonstrating that modern large-scale LMs capture a significant amount of factual knowledge (Petroni et al, 2019; Jiang et al, 2020; Poerner et al, 2019).
  • This knowledge is generally probed by having the LM fill in the blanks of cloze-style prompts such as.
  • 0.2 0.2 0.1 0.09 0.09 0.07 0.06 0.04 0.03 0.03 0.02 el war mr mg bn tl sw pa ceb yo ilo fact
Highlights
  • Language models (LMs; (Church, 1988; Kneser and Ney, 1995; Bengio et al, 2003)) learn to model the probability distribution of text, and in doing so capture information about various aspects of the syntax or semantics of the language at hand
  • We study the intersection of multilinguality and the factual knowledge included in LMs
  • We perform experiments on X-FACTR (§ 5), comparing and contrasting across languages and LMs to answer the following research questions: (1) How and why does performance vary across different languages and models? (2) Can multilingual pre-training increase the amount of factual knowledge in LMs over monolingual pre-training? (3) How much does knowledge captured in different languages overlap? We find that the factual knowledge retrieval of M-LMs in high-resource languages is easier than in low-resource languages, but the overall performance is relatively low, indicating that this is a challenging task
  • Considering that we have more raw than code-switched sentences in the dataset, this seems to indicate that English entities are easier to predict than their promptlanguage counterparts, which might be because facts expressed in English are better learned in the pre-trained model due to training data abundance
  • We examine the intersection of multilinguality and the factual knowledge included in LMs by creating a multilingual and multi-token benchmark XFACTR, and performing experiments comparing and contrasting across languages and LMs
  • While previous works (Petroni et al, 2019; Jiang et al, 2020; Poerner et al, 2019) have limited examination to single-token entities (e.g. “France”), we expand our setting to include multi-token entities (e.g. “United States”), which comprise more than 75% of facts included in our underlying database (Wikidata; § 3.2)
  • The results demonstrate the difficulty of this task, and that knowledge contained in LMs varies across languages
Methods
  • The authors propose to use code-switching to create data to fine-tune pretrained LMs, replacing entity mentions in one language (e.g., English/Greek) with their counterparts in another language (e.g., Greek/English).
  • Through this bi-directional code-switching, entity mentions serve as pivots, enabling knowledge that was originally learned in one language to be shared with others.
  • The authors fine-tune M-BERT using the masked LM objective on this data, with 15% of non-mention words and 50% of mention words masked out.8
Results
  • The authors run both the independent and confidence-based decoding methods with 3 M-LMs, and when available 8 monolingual LMs, across 23 languages,6 with results shown in Fig. 3.
  • Even in the most favorable settings, the performance of state-of-that-art M-LMs at retrieving factual knowledge in the X-FACTR benchmark is relatively low, achieving less than 15% on high-resource languages (e.g., English and Spanish) and less than 5% for some low-resource languages (e.g., Marathi and Yoruba)
  • This may initially come as a surprise, given the favorable performance reported in previous papers (Petroni et al, 2019; Jiang et al, 2020), which achieved accuracies over 30% on English.
  • Considering that the authors have more raw than code-switched sentences in the dataset, this seems to indicate that English entities are easier to predict than their promptlanguage counterparts, which might be because facts expressed in English are better learned in the pre-trained model due to training data abundance
Conclusion
  • The authors examine the intersection of multilinguality and the factual knowledge included in LMs by creating a multilingual and multi-token benchmark XFACTR, and performing experiments comparing and contrasting across languages and LMs.
  • Future directions include other pre-training or fine-tuning methods to improve retrieval performance and methods that encourage the LM to predict entities of the right types
Summary
  • Introduction:

    Language models (LMs; (Church, 1988; Kneser and Ney, 1995; Bengio et al, 2003)) learn to model the probability distribution of text, and in doing so capture information about various aspects of the syntax or semantics of the language at hand.
  • Recent works have presented intriguing results demonstrating that modern large-scale LMs capture a significant amount of factual knowledge (Petroni et al, 2019; Jiang et al, 2020; Poerner et al, 2019).
  • This knowledge is generally probed by having the LM fill in the blanks of cloze-style prompts such as.
  • 0.2 0.2 0.1 0.09 0.09 0.07 0.06 0.04 0.03 0.03 0.02 el war mr mg bn tl sw pa ceb yo ilo fact
  • Objectives:

    The authors aim to handle sentences containing multiple mask tokens conditioning on the surrounding actual words:.
  • Methods:

    The authors propose to use code-switching to create data to fine-tune pretrained LMs, replacing entity mentions in one language (e.g., English/Greek) with their counterparts in another language (e.g., Greek/English).
  • Through this bi-directional code-switching, entity mentions serve as pivots, enabling knowledge that was originally learned in one language to be shared with others.
  • The authors fine-tune M-BERT using the masked LM objective on this data, with 15% of non-mention words and 50% of mention words masked out.8
  • Results:

    The authors run both the independent and confidence-based decoding methods with 3 M-LMs, and when available 8 monolingual LMs, across 23 languages,6 with results shown in Fig. 3.
  • Even in the most favorable settings, the performance of state-of-that-art M-LMs at retrieving factual knowledge in the X-FACTR benchmark is relatively low, achieving less than 15% on high-resource languages (e.g., English and Spanish) and less than 5% for some low-resource languages (e.g., Marathi and Yoruba)
  • This may initially come as a surprise, given the favorable performance reported in previous papers (Petroni et al, 2019; Jiang et al, 2020), which achieved accuracies over 30% on English.
  • Considering that the authors have more raw than code-switched sentences in the dataset, this seems to indicate that English entities are easier to predict than their promptlanguage counterparts, which might be because facts expressed in English are better learned in the pre-trained model due to training data abundance
  • Conclusion:

    The authors examine the intersection of multilinguality and the factual knowledge included in LMs by creating a multilingual and multi-token benchmark XFACTR, and performing experiments comparing and contrasting across languages and LMs.
  • Future directions include other pre-training or fine-tuning methods to improve retrieval performance and methods that encourage the LM to predict entities of the right types
Tables
  • Table1: X-FACTR benchmark statistics (in thousands). More details in the Appendix (Tab. 5 and Fig. 6)
  • Table2: Error cases of M-BERT in English and ratio of different error types in English, Spanish, and Greek (%). Error cases in Spanish and Greek can be found in Tab. 9 in the Appendix
  • Table3: Accuracy of different decoding methods using M-BERT on English and Chinese (%)
  • Table4: Accuracy of M-BERT after fine-tuning on raw and code-switched text (%)
  • Table5: Detailed X-FACTR Benchmark statistics. Languages are ranked by the total number of facts
  • Table6: Error analysis on the prompts after instantiating with actual examples. We note that the error categories are not mutually exclusive. *: The Russian inflection percentage includes gender and number errors, unlike the other languages; the Russian annotator also marked all erroneous sentences as “awkward", skewing the results
  • Table7: Prediction results of M-BERT where the bestperforming decoding method makes correct predictions while the independent prediction method does not
  • Table8: Shortcut name of each multilingual/monolingual LM in HuggingFace’s Transformers library, and their training copora. ◦ The OSCAR corpus is extracted from the CommonCrawl corpus. TwNC is a multifaceted Dutch News Corpus. † SoNaR-500 is a multi-genre Dutch reference corpus. OPUS is a translated text corpus from the web. Europarl is a corpus of parallel text
  • Table9: Error cases of M-BERT in Spanish and Greek (%)
  • Table10: Accuracy on different languages using different LMs (%). We use M = 5 mask tokens for en, fr, nl es, vi (on the left) and M = 10 mask tokens for the other languages on the right. Best results for each language-part combination are in bold. “-” denotes missing/unsupported models
Download tables as Excel
Related work
  • Factual Knowledge Retrieval from LMs Several works have focused on probing factual knowledge solely from pre-trained LMs without access to external knowledge. They do so by either using prompts and letting the LM fill in the blanks, which assumes that the LM is a static knowledge source (Petroni et al, 2019; Jiang et al, 2020; Poerner et al, 2019; Bouraoui et al, 2020), or fine-tuning the LM on a set of question-answer pairs to directly generate answers, which dynamically adapts the

    LM to this particular task (Roberts et al, 2020). Impressive results demonstrated by these works indicate that large-scale LMs contain a significant amount of knowledge, in some cases even outperforming competitive question answering systems relying on external resources (Roberts et al, 2020). Petroni et al (2020) further shows that LMs can generate even more factual knowledge when augmented with retrieved sentences. Our work builds on these works by expanding to multilingual and multi-token evaluation, and also demonstrates the significant challenges posed by this setting.

    Multilingual Benchmarks Many multilingual benchmarks have been created to evaluate the performance of multilingual systems on different natural language processing tasks, including question answering (Artetxe et al, 2020; Lewis et al, 2019; Clark et al, 2020), natural language understanding (Conneau et al, 2018; Yang et al, 2019a; Zweigenbaum et al, 2018; Artetxe and Schwenk, 2019), syntactic prediction (Nivre et al, 2018; Pan et al, 2017), and comprehensive benchmarks covering multiple tasks (Hu et al, 2020; Liang et al, 2020). We focus on multilingual factual knowledge retrieval from LMs, which to our knowledge has not been covered by any previous work.
Funding
  • This work was supported by a gift from Bosch Research
Study subjects and analysis
subject-object pairs with probability proportional to their frequency: 1000
Since T-REx aligns facts from Wikidata with sentences in abstract sections from DBpedia, we can estimate the commonality of each fact based on its frequency of being grounded to a sentence in these abstracts. For each of the 46 relations in T-REx, we sample 1000 subject-object pairs with probability proportional to their frequency. Frequency-proportional sampling makes the distribution of the facts in our benchmark close to real usage and covers facts of different popularity

error cases: 400
Even with access to an oracle for the number of target tokens, though, the performance is still lower than 20%. To understand the types of errors made by the LMs, we sample over 400 error cases in English, Spanish, and Greek, and classify them. The error type distributions along with English examples are outlined in Tab. 2

Reference
  • Judit Ács. 2019. Exploring bert’s vocabulary. Accessed May 2020.
    Google ScholarFindings
  • Antonios Anastasopoulos and Graham Neubig. 2019. Pushing the limits of low-resource morphological inflection. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 984–996, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of ACL 2020.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the ACL 2019.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
    Google ScholarLocate open access versionFindings
  • Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. 2020. Inducing relational knowledge from BERT. In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, USA.
    Google ScholarLocate open access versionFindings
  • José Cañete, Gabriel Chaperon, Rodrigo Fuentes, and Jorge Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
    Google ScholarFindings
  • Kyunghyun Cho. 2019. Bert has a mouth and must speak, but it is not an mrf. Accessed May 2020.
    Google ScholarFindings
  • Kenneth Ward Church. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing, pages 136–143, Austin, Texas, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. In Transactions of the Association of Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
    Findings
  • Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 7057–7067.
    Google ScholarFindings
  • Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hady ElSahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon S. Hare, Frédérique Laforest, and Elena Simperl. 2018. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 712, 2018. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112– 6121, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multitask benchmark for evaluating cross-lingual generalization. CoRR, abs/2003.11080.
    Findings
  • Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics (TACL).
    Google ScholarFindings
  • James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing
    Google ScholarLocate open access versionFindings
  • Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 181–184. IEEE.
    Google ScholarLocate open access versionFindings
  • Yuri Kuratov and Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for russian language. CoRR, abs/1905.07213.
    Findings
  • Carolin Lawrence, Bhushan Kotnis, and Mathias Niepert. 2019. Attending to future tokens for bidirectional sequence generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1–10, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019. MLQA: Evaluating Cross-lingual Extractive Question Answering. arXiv preprint arXiv:1910.07475.
    Findings
  • Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Bruce Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Rangan Majumder, and Ming Zhou. 2020. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. CoRR, abs/2004.01401.
    Findings
  • Bill Y. Lin, Frank F. Xu, Kenny Q. Zhu, and Seung-won Hwang. 2018. Mining cross-cultural differences and similarities in social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 709–719. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
    Findings
  • Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. Camembert: a tasty french language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Joakim Nivre, Mitchell Abrams, Željko Agic, Lars Ahrenberg, Lene Antonsen, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, et al. 2018. Universal dependencies 2.2.
    Google ScholarLocate open access versionFindings
  • Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Crosslingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fabio Petroni, Patrick S. H. Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2020. How context affects language models’ factual predictions. CoRR, abs/2005.04611.
    Findings
  • Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nina Poerner, Ulli Waltinger, and Hinrich Schütze. 2019. E-bert: Efficient-yet-effective entity embeddings for bert. CoRR, abs/1911.03681.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Michael Ringgaard, Rahul Gupta, and Fernando C. N. Pereira. 2017. SLING: A framework for frame semantic parsing. CoRR, abs/1710.07032.
    Findings
  • Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? CoRR, abs/2002.08910.
    Findings
  • Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. 2020. Masked language model scoring. In Proceedings of ACL 2020.
    Google ScholarLocate open access versionFindings
  • Stefan Schweter. 2020. Berturk - bert models for turkish.
    Google ScholarFindings
  • Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. 2015. From feedforward to recurrent LSTM neural networks for language modeling. IEEE ACM Trans. Audio Speech Lang. Process., 23(3):517–529.
    Google ScholarLocate open access versionFindings
  • Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim. 2019. Bertje: A dutch BERT model. CoRR, abs/1912.09582.
    Findings
  • Alex Wang and Kyunghyun Cho. 2019. BERT has a mouth, and it must speak: BERT as a markov random field language model. CoRR, abs/1902.04094.
    Findings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
    Findings
  • Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Emerging crosslingual structure in pretrained language models. In Proceedings of ACL 2020.
    Google ScholarLocate open access versionFindings
  • Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019a. PAWS-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687– 3692, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019b. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 5754–5764.
    Google ScholarLocate open access versionFindings
  • Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pages 39–42.
    Google ScholarLocate open access versionFindings
  • Despite these two outliers, we consider the rest of our prompts to be of high quality. Even if small inflection or grammatical gender assignment mistakes occur (e.g. in Greek) this should not render the prompt unintelligible to native speakers – the burden is on the model to be robust to such slight variations, just as humans are. We point out that the prompts can be awkward or incorrect for some senses captured by the relation, an issue unrelated to our gender heuristics or automatic inflection. This issue, though, is also present in the LAMA English prompts (Petroni et al., 2019; Jiang et al., 2020) and is the result of the original Wikidata annotation.
    Google ScholarLocate open access versionFindings
  • We outline here the exact concrete formulation of our multi-token decoding algorithms. Given a sentence with multiple mask tokens, e.g., Eq. 2, we can either generate outputs in parallel independently or one at a time conditioned on the previously generated tokens. These methods are similar to the prediction problems that BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019b) perform in their pre-training stages respectively. We define c ∈ Rn as the confidence of each prediction, with details varying by prediction method.
    Google ScholarLocate open access versionFindings
  • In the refinement stage, we choose from all predicted tokens the one with the lowest confidence (i.e., the lowest probability) and re-predict it (Ghazvininejad et al., 2019): yk = argmax p(yk|si:j \ k), ck = p(yk|si:j \ k), yk k = argmin ck.
    Google ScholarFindings
  • #single-word facts 18903 13886 12812 13463 3391 1312 210 6241 1057 2506 1964 3909
    Google ScholarFindings
  • #single-word facts 742 53 3257 199 2981 3208 2840 67 1748 930 2099
    Google ScholarFindings
  • #multi-word facts 12292 9330 4903 7678 4361 3908 3994 5388 3197 3679 1954
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments