AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
View the video
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We introduce knowledge-intensive language tasks, a benchmark for assessing models that need to condition on specific knowledge in a defined snapshot of Wikipedia to solve tasks spanning five domains
KILT: a Benchmark for Knowledge Intensive Language Tasks
NAACL-HLT, pp.2523-2544, (2021)
Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, ...More
PPT (Upload PPT)
- There has been substantial progress on natural language processing tasks where the inputs are short textual contexts such as a sentences, paragraphs, or perhaps a handful of documents
- We contribute: 1. a publicly-available benchmark of knowledgeintensive tasks aligned to a single Wikipedia snapshot, to spur the development of generalpurpose models and enable their comparison; 2. an open-source library to facilitate the development of new architectures for knowledgeintensive tasks; 3. a provenance indication for all instances in knowledge-intensive language tasks (KILT), made more comprehensive with an annotation campaign, which allows to jointly assess output accuracy and ability to provide supporting evidence in the knowledge source; 4. a comparative performance of various modeling approaches, showing promising results for general baselines across all tasks
- We introduce KILT, a benchmark for assessing models that need to condition on specific knowledge in a defined snapshot of Wikipedia to solve tasks spanning five domains
- The goal is to catalyze and facilitate research towards general and explainable models equipped with task-agnostic representations of knowledge
- Our experiments show promising results for a general solution combining dense retrieval and seq2seq generations, there is large room for improvements
- We plan to explore multi-task learning to exploit synergies between KILT tasks and datasets in the future, and to develop general approaches for representing largescale textual knowledge sources that are useful for multiple downstream tasks
1 Introduction:There has been substantial progress on natural language processing tasks where the inputs are short textual contexts such as a sentences, paragraphs, or perhaps a handful of documents.
- While IR focuses of finding relevant material, the tasks the authors consider focus on more fine-grained behavior, such as producing specific answers to queries
- For such knowledge-intensive tasks, general infrastructure and architectures across tasks have yet to emerge, and fundamental research questions remain open.
- To facilitate research on models that must access specific information in a knowledge source, the authors introduce KILT, a benchmark and library for Knowledge Intensive Language Tasks.
- KILT enables researchers to develop generalpurpose models and evaluate them across multiple domains, testing hypotheses around task-agnostic memory and knowledge representations without indexing different large-scale textual corpora or writing new IO routines.
- The authors evaluate whether systems can provide evidence for their predictions
- With this aim, the authors augment every instance in KILT with provenance information in the form of textual spans in specific Wikipedia pages to corroborate the output.
- The authors contribute: 1. a publicly-available benchmark of knowledgeintensive tasks aligned to a single Wikipedia snapshot, to spur the development of generalpurpose models and enable their comparison; 2. an open-source library to facilitate the development of new architectures for knowledgeintensive tasks; 3. a provenance indication for all instances in KILT, made more comprehensive with an annotation campaign, which allows to jointly assess output accuracy and ability to provide supporting evidence in the knowledge source; 4. a comparative performance of various modeling approaches, showing promising results for general baselines across all tasks
2 Knowledge Source:A main feature of the KILT benchmark is the use of a unified knowledge source that contains all information necessary for all tasks.
- Tasks provide provenance, defined as a set of textual spans in Wikipedia that contain evidence for producing an output given a specific input.
- These provenance spans range from single entities, short answers, sentences, paragraphs, to whole articles.
- The authors consider five tasks that use Wikipedia as a knowledge source for KILT: fact checking, open domain question answering, slot filling, entity linking, and dialogue.
- The diversity of these tasks challenge models to represent knowledge flexibly.
3_1 Fact Checking:Fact checking verifies a claim against a collection of evidence.
- It requires deep knowledge about the claim and reasoning over multiple documents.
- The authors model multiple -valid provenance sets per label.
- 30% of claims have more than one -valid provenance and 16% require the combination of multiple evidence spans.
- For KILT, the authors merge the two versions of FEVER into a single resource and consider only supported and refuted claims.
- The authors exclude all claims classified as not having enough information since these instances have no evidence to assess the claim and cannot be mapped to the KILT knowledge source.
- The authors design KILT as an in-KB resource where each instance can be answered and corroborated by information in the knowledge source
3_2 Entity Linking:Entity Linking (EL) assigns a unique Wikipedia page to entities mentioned in text.
- The output is the title of the Wikipedia page for the entity mention plus provenance pointing to the entire page.
- The authors match Wikipedia pages specified in various datasets to the KILT knowledge source.
- The authors release such data in KILT format (9M train instances), following the splits of Wu et al (2019).
- Following Hoffart et al (2011b) the authors consider testa as dev and testb as test.
- WNED-WIKI (Guo and Barbosa, 2018) is a dataset automatically created by sampling document from the 2013/06/06 Wikipedia dump, and balancing the difficulty of linking each mention.
- The authors randomly split the dataset into dev and test.
- WNED-CWEB (Guo and Barbosa, 2018) is a dataset created with the same strategy as WNEDWIKI, but sampling from the ClueWeb 2012 corpora annotated with the FACC1 system. the authors randomly split into dev and test
3_3 Slot Filling:The goal of the Slot Filling (SF) is to collect information on certain relations of entities from large collections of natural language texts.
- To consider an open-domain version of this dataset and align the input/output with the KILT interface the authors reformatted this dataset, as follows: (i) exclude neagative pairs - since the authors consider the whole knowledge source as text all questions can be answered; (ii) group template questions by the subject-relation pair, and create a single datapoint for each; (iii) randomly split the set of relations, in line with the original dataset, into three disjoint sets train, dev (12 relations) and test (24 relations)—systems are tested on relations never seen during training; (iv) use the subject entity as the query against Wikipedia titles for the first step of the mapping strategy, and (v) include all template questions in a meta field.
- The authors randomly select 5k facts for both dev and test set
3_4 Open Domain Question Answering:Open domain Question Answering (Chen et al, 2017) is the task of producing the correct answer for a question, without a predefined location for the answer.
- Each question comes with an accompanied Wikipedia page with an annotated long answer and a short answer
- The authors consider both long and short answers spans as provenance.
- For each question-answer pair, a set of supporting sentences are provided, and the authors consider these as provenance.
- As the original work first collected question-answer pairs from the subreddit Explain Like the author is Five, the documents are not guaranteed to contain evidence.
- The authors collect annotations using Amazon Mechanical Turk, asking evaluators to select which supporting documents from Wikipedia can be used to answer the question.
- The authors treat these as gold provenance annotations for evaluation
3_5 Dialogue:Chitchat dialogue is the task of developing an engaging chatbot that can discuss a wide array of topics with a user, which often relies on topical, factual knowledge.
- The authors consider the conversation history as input and the utterance as output.
- Wizard of Wikipedia (Dinan et al, 2019) is a large dataset of conversation grounded with knowledge retrieved from Wikipedia.
- One speaker in the conversation must ground their utterances in a specific knowledge sentence, chosen from a Wikipedia.
- The chosen sentence forms the provenance for KILT.
- The authors discard cases where the dataset does not contain provenance.
- The authors consider a full open-domain setting where no topic is provided for the conversation and the model must search over all of Wikipedia for knowledge at each dialogue turn.
- The authors use the unseen split for dev and test set
4 Evaluation Metrics:Various tasks in the KILT Benchmark need to be evaluated differently, which can make task-wide comparison challenging.
- For datasets that require more than one page of evidence (e.g., FEVER and HotpotQA), the authors use the lowest ranked page in each provenance set to determine its position and remove the other pages in the set from the rank
- For both metrics, the authors report the mean over all test datapoints.
- The authors only award Accuracy, EM, ROUGE-L, and F1 points to KILT-AC, KILT-EM, KILT-RL and KILT-F1 respectively, if the R-precision is 1
- This is equivalent to awarding points if the system finds a complete set of provenance Wikipedia pages for at least one ground truth output given the input.
- The authors choose this metric to emphasize that systems must be able to explain their output with proper evidence, not answer
5 Baselines:The KILT tasks provide a dual challenge of retrieving information and conditioning upon that to create an output.
- Approaches to the KILT Benchmark should be able to generalize to many different tasks, as developing model architectures that can represent knowledge generally is a valuable direction.
- The authors use the public model pre-trained on FEVER, and consider not enough information predictions as false.
- For Open Domain QA and Slot Filling, the authors use DPR combined with the pre-trained BERTbased extractive reading comprehension model of Karpukhin et al (2020).
- The authors treat all KILT tasks as generative, relying on the knowledge accumulated by the model while pretraining, with no retrieval ( to Roberts et al (2020)).
- For the BART+DPR baseline, the authors follow Petroni et al (2020) to retrieve and prepend the top-3 passages from DPR for each input sample and use contextenhanced training data to fine-tune a BART model.
- The authors will continue adding baselines and pre-trained models to the library, as well as logic to interchange and experiment with different modular components
6 Results:The BART+DPR baseline that incorporates an explicit retrieval step in addition to the generative pretraining, works well.
- It outperforms some of the task-specific solutions, and gets close to others.
- By formulating Entity Linking within KILT, the authors can evaluate the ability of seq2seq models at this task.
- They perform surprisingly well, even without any explicit access to knowledge (i.e., BART and T5).
- Report results for BART and T5 since answers are generated solely from the input with no explicit retrieval and there is no straightforward way to access provenance for each prediction.
- The generally low absolute numbers leave a large room for improvement for systems able to provide the correct output and successfully justify their decision
7 Discussion:There are custom solutions that can simplify the slot filling task.
- Subject entities can be used for lookups by title in Wikipedia to retrieve knowledge, and structured human-curated resources could be used to get all answers right.
- The authors are interested in testing if a general model can extract attributes about specific entities from a large body of text.
- The provenance to justify each system prediction can come from anywhere, including a different system, and this is difficult to detect.
- The authors' provenance might not be exhaustive—given the redundancy of information in Wikipedia there could be other pages with the knowledge needed to solve a KILT instance.
- The authors conduct an annotation campaign to mitigate the problem
8 Related Work:Several natural language benchmarks have been introduced to track and support NLP progress, including natural language understanding (Wang et al., 11https://www.wikidata.org.
- The authors focus on multi-domain tasks that need to seek knowledge in a large body of documents to produce an output.
- There exist several tasks and resources that define large-scale external knowledge sources—including the TAC-KBP challenges (McNamee and Dang, 2009; Ji et al, 2010; Surdeanu, 2013; Surdeanu and Ji, 2014), ARC (Clark et al, 2018), TriviaQA-web (Joshi et al, 2017), QuasarT (Dhingra et al, 2017), WebQuestions (Berant et al, 2013) and ComplexWebQuestions (Talmor and Berant, 2018)—in KILT the authors exclusively consider publicly available Wikipedia-based datasets in order to merge and unify the knowledge source
9 Conclusion:The authors introduce KILT, a benchmark for assessing models that need to condition on specific knowledge in a defined snapshot of Wikipedia to solve tasks spanning five domains.
- The goal is to catalyze and facilitate research towards general and explainable models equipped with task-agnostic representations of knowledge.
- The authors' experiments show promising results for a general solution combining dense retrieval and seq2seq generations, there is large room for improvements.
- The authors find that provenance of current models is generally low.
- The authors plan to explore multi-task learning to exploit synergies between KILT tasks and datasets in the future, and to develop general approaches for representing largescale textual knowledge sources that are useful for multiple downstream tasks
- Table1: Datasets and tasks considered in KILT
- Table2: Downstream performance on the test data. Baselines are grouped by task-specific (ts) and general with implicit (im) or explicit (ex) knowledge access. Task-specific solutions cannot be generally applied to all datasets in KILT, hence there are empty cells in the top part of the table. We report the typical metric to assess performance for each dataset, specified in the first row
- Table3: Page-level R-Precision on test data. For DPR, we additionally report the performance after the BERT-based classifier (for FE) or reader (for NQ,HP,TR) re-ranked relevant pages (i.e., DPR + BERT). R-Precision is equivalent to Precision@1 for all datasets except FEV and HoPo that require multi-hop
- Table4: KILT scores on the test data. We do not report KILT scores for baselines with implicit knowledge access since no provenance information is returned by them. We report the KILT version of donwstream metrics, specified in the first row (to save space we abbreviate KILT-RL and KILT-F1). KILT scores are computed by awarding points only if provenance pages are found (i.e., R-Precision = 1)
- Table5: Baselines considered and total number of their trainable parameters. Non trainable (nt) parameters and index (idx) sizes are also reported
- Table6: Datasets statistics. APS refers to the average number of textual spans in each provenance set—for most of the datasets a single span is sufficient to provide enough evidence while FEV and HoPo might require more (hence they require multi-hop reasoning). APN indicates the average number of equally valid provenance sets for each instance while APP the average number of Wikipedia pages overall in the provenance (note that multiple spans might refer to the same Wikipedia page). Finally AAN reports the average number of equally valid gold answers per instance. We additionally report the size of the train, dev and test split for each dataset
- Table7: Table 7
- Table8: AIDA CoNLL-YAGO
- Table9: WNED-WIKI
- Table10: WNED-CWEB
- Table11: Table 11
- Table12: Zero Shot RE
- Table13: Natural Questions model
- Table14: HotpotQA
- Table15: TriviaQA
- Table16: ELI5 model
- Table17: Wizard of Wikipedia
- Several natural language benchmarks have been introduced to track and support NLP progress, including natural language understanding (Wang et al., 11https://www.wikidata.org
2018, 2019), multitask question answering (McCann et al, 2018), reading comprehension (Dua et al, 2019), question understanding (Wolfson et al, 2020), and dialogue (Shuster et al, 2019). We focus on multi-domain tasks that need to seek knowledge in a large body of documents to produce an output. Although there exist several tasks and resources that define large-scale external knowledge sources—including the TAC-KBP challenges (McNamee and Dang, 2009; Ji et al, 2010; Surdeanu, 2013; Surdeanu and Ji, 2014), ARC (Clark et al, 2018), TriviaQA-web (Joshi et al, 2017), QuasarT (Dhingra et al, 2017), WebQuestions (Berant et al, 2013) and ComplexWebQuestions (Talmor and Berant, 2018)—in KILT we exclusively consider publicly available Wikipedia-based datasets in order to merge and unify the knowledge source.
- We remove from the dev and test sets all outputs for which the BLEU score is lower than a threshold for at least one provenance span (we use 0.5 as threshold) — this is meant to ensure high quality mappings in the evaluation sets — discarding on average 18% of test and dev data (for all tasks except entity linking)
- 30% of claims have more than one equally-valid provenance and 16% require the combination of multiple evidence spans
Study subjects and analysis
KILT aims to lower the entry barrier for such research by formulating several knowledge-intensive NLP tasks with respect to a common interface and the same unified knowledge source—a single Wikipedia snapshot. The KILT benchmark consists of eleven datasets spanning five distinct tasks, and includes the test set for all datasets considered.2. An important aim of KILT is cover many different ways of seeking knowledge
popular EL datasets: 3
To map the provenance (whole Wikipedia page), we simply match Wikipedia pages specified in various datasets to the KILT knowledge source. We consider three popular EL datasets in KILT, two of which do not contain a train set but should be assessed in a zero-shot fashion. Note that, in addition to the AY2 train set, the whole knowledge source can be used as training data by exploiting hyperlinks
input-output pairs: 25
KILT datasets’ interface. Each dataset is represented as a JSON Line file. The. Entity linking BART predictions, schematic of 25 input-output pairs condensed, in each one a single entity in tagged. BLEU score distribution in train data per provenance. For TriviaQA, we try to map all object aliases for the answer. FEVER has the oldest Wikipedia snapshot. We discards on average 17.9% dev and 17.65% test data
- Thorne et al. (2018a) Hoffart et al. (2011b) Guo and Barbosa (2018) Guo and Barbosa (2018) Elsahar et al. (2018) Levy et al. (2017) Kwiatkowski et al. (2019) Yang et al. (2018) Joshi et al. (2017) Fan et al. (2019b) Dinan et al. (2019)
- Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.
- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 201Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533– 1544.
- Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. ACL.
- Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 201Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904.
- Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. Proceedings of the International Conference on Learning Representations (ICLR).
- Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, and Matt Gardner. 201Orb: An open reading benchmark for comprehensive evaluation of machine reading comprehension. arXiv preprint arXiv:1912.12598.
- Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Elena Simperl, and Frederique Laforest. 2018. T-rex: A large scale alignment of natural language with knowledge base triples. LREC.
- Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2019a. Using local knowledge graph construction to scale Seq2Seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4186–4196, Hong Kong, China. Association for Computational Linguistics.
- Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019b. ELI5: long form question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3558–3567. Association for Computational Linguistics.
- Paolo Ferragina and Ugo Scaiella. 2011. Fast and accurate annotation of short texts with wikipedia pages. IEEE software, 29(1):70–75.
- Zhaochen Guo and Denilson Barbosa. 2018. Robust named entity disambiguation with random walks. Semantic Web, 9(4):459–479.
- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pretraining.
- Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, Edwin Lewis-Kelham, Gerard De Melo, and Gerhard Weikum. 2011a. Yago2: exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th international conference companion on World wide web, pages 229–232.
- Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011b. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. Association for Computational Linguistics.
- Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis. 2010. Overview of the tac 2010 knowledge base population track. In Third text analysis conference (TAC 2010), volume 3, pages 3–3.
- Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
- Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 20Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300.
- Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. CoNLL.
- Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, and Luke Zettlemoyer. 2020a. Pre-training via paraphrasing. arXiv preprint arXiv:2006.15020.
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-tosequence pre-training for natural language generation, translation, and comprehension. ArXiv, abs/1910.13461.
- Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020b. Retrieval-augmented generation for knowledge-intensive nlp tasks.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Christopher D Manning, Hinrich Schütze, and Prabhakar Raghavan. 2008. Introduction to information retrieval. Cambridge university press.
- Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
- Paul McNamee and Hoa Trang Dang. 2009. Overview of the tac 2009 knowledge base population track. In Text Analysis Conference (TAC), volume 17, pages 111–113. National Institute of Standards and Technology (NIST) Gaithersburg, Maryland....
- Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. 2017. Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476.
- Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6859–6866.
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- Fabio Petroni, Luciano del Corro, and Rainer Gemulla. 2015. Core: Context-aware open relation extraction with factorization machines. In EMNLP. Assoc. for Computational Linguistics.
- Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2020. How context affects language models’ factual predictions. AKBC.
- Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? EMNLP.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019a. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019b. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+
- Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 74–84.
- Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910.
- Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Languageindependent named entity recognition. arXiv preprint cs/0306050.
- Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, and Jason Weston. 2019. The dialogue dodecathlon: Open-domain knowledge and image grounded conversational agents. arXiv preprint arXiv:1911.03768.
- Mihai Surdeanu. 2013. Overview of the tac2013 knowledge base population evaluation: English slot filling and temporal slot filling. In TAC.
- Mihai Surdeanu and Heng Ji. 2014. Overview of the english slot filling track at the tac2014 knowledge base population evaluation. In Proc. Text Analysis Conference (TAC2014).
- Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651.
- James Thorne and Andreas Vlachos. 2020. Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation. arXiv preprint arXiv:2004.14366.
- James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a. FEVER: a large-scale dataset for fact extraction and verification. In NAACL-HLT.
- James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018b. The fact extraction and verification (fever) shared task. EMNLP.
- James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2019. The fever2. 0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), pages 1–6.
- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
- Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pages 3261–3275.
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Thomas Wolf, L Debut, V Sanh, J Chaumond, C Delangue, A Moi, P Cistac, T Rault, R Louf, M Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
- Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
- Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2019. Zeroshot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814.
- Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. 2019. Pretrained encyclopedia: Weakly supervised knowledgepretrained language model. arXiv preprint arXiv:1912.09637.
- Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, and Dhruv Batra. 2019. Evalai: Towards better evaluation systems for ai agents.