Constrained Fact Verification for FEVER

Adithya Pratapa
Adithya Pratapa
Sai Muralidhar Jayanthi
Sai Muralidhar Jayanthi
Kavya Nerella
Kavya Nerella

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views27
Keywords:
verification modelFact Extraction and VERificationclaim verificationnatural language inferenceclosed worldMore(6+)
Weibo:
We identify a critical issue with existing claim verification systems, especially the recent models that utilize large pre-trained language models

Abstract:

Fact-verification systems are well explored in the NLP literature with growing attention owing to shared tasks like FEVER. Though the task requires reasoning on extracted evidence to verify a claim’s factuality, there is little work on understanding the reasoning process. In this work, we propose a new methodology for fact-verification, s...More

Code:

Data:

Introduction
  • A rapid increase in the spread of misinformation on the Internet has necessitated automated solutions to determine the validity of a given piece of information
  • To this end, the Fact Extraction and VERification (FEVER) shared task (Thorne et al, 2018a)1 introduced a dataset for evidencebased fact verification.
  • Several recent works (Liu et al, 2020; Soleimani et al, 2020; Zhao et al, 2020) leverage representations from large pre-trained language models (LMs) like BERT (Devlin et al, 2019), and RoBERTa (Liu et al, 2019) to achieve state-of-the-art results on FEVER.
  • Lee et al (2020) developed a fact verification system solely based on large pretrained LMs and presented their superior zero-shot performance on FEVER compared to a random
Highlights
  • A rapid increase in the spread of misinformation on the Internet has necessitated automated solutions to determine the validity of a given piece of information
  • We propose various pre-training strategies for large pre-trained language models (LMs) to induce a closedworld setting during fact verification in Fact Extraction and VERification (FEVER)
  • We found the models trained with Original curriculum performed better than our proposed curricula (CWA, Skip–fact) except on symmetric FEVER where Transformer-XH with Skip–fact does slightly better
  • We identify a critical issue with existing claim verification systems, especially the recent models that utilize large pre-trained LMs
  • We propose to perform fact verification under a closed-world setting and present our results on the task of FEVER
  • While it is hard to evaluate the reliance on implicit pretrained knowledge, our initial results indicate that such reliance is helpful for FEVER
Methods
  • The authors present the methodology to train constrained fact-verification models for the FEVER shared task.
  • Many state-of-the-art FEVER models use the standard BERT encoder (Devlin et al, 2019) to encode a concatenation of claim and evi-.
  • 2. (Comedian) A popular saying, variously quoted but generally attributed to Ed Wynn, is, “A comic says funny things; a comedian says things funny”, which draws a distinction between how much of the comedy can be attributed to verbal content and how much to acting and persona.
Results
  • Given a question and a context comprising of a set of simple facts and rules in natural language, models are expected to reason only based on the provided context, thereby emulating the ability to perform closed-world reasoning.
  • They observe high performances (≥95% accuracy) on the synthetic test set, motivating them to adapt a similar training methodology for FEVER
Conclusion
  • The authors identify a critical issue with existing claim verification systems, especially the recent models that utilize large pre-trained LMs.
  • While it is hard to evaluate the reliance on implicit pretrained knowledge, the initial results indicate that such reliance is helpful for FEVER
Summary
  • Introduction:

    A rapid increase in the spread of misinformation on the Internet has necessitated automated solutions to determine the validity of a given piece of information
  • To this end, the Fact Extraction and VERification (FEVER) shared task (Thorne et al, 2018a)1 introduced a dataset for evidencebased fact verification.
  • Several recent works (Liu et al, 2020; Soleimani et al, 2020; Zhao et al, 2020) leverage representations from large pre-trained language models (LMs) like BERT (Devlin et al, 2019), and RoBERTa (Liu et al, 2019) to achieve state-of-the-art results on FEVER.
  • Lee et al (2020) developed a fact verification system solely based on large pretrained LMs and presented their superior zero-shot performance on FEVER compared to a random
  • Methods:

    The authors present the methodology to train constrained fact-verification models for the FEVER shared task.
  • Many state-of-the-art FEVER models use the standard BERT encoder (Devlin et al, 2019) to encode a concatenation of claim and evi-.
  • 2. (Comedian) A popular saying, variously quoted but generally attributed to Ed Wynn, is, “A comic says funny things; a comedian says things funny”, which draws a distinction between how much of the comedy can be attributed to verbal content and how much to acting and persona.
  • Results:

    Given a question and a context comprising of a set of simple facts and rules in natural language, models are expected to reason only based on the provided context, thereby emulating the ability to perform closed-world reasoning.
  • They observe high performances (≥95% accuracy) on the synthetic test set, motivating them to adapt a similar training methodology for FEVER
  • Conclusion:

    The authors identify a critical issue with existing claim verification systems, especially the recent models that utilize large pre-trained LMs.
  • While it is hard to evaluate the reliance on implicit pretrained knowledge, the initial results indicate that such reliance is helpful for FEVER
Tables
  • Table1: Example from Anonymized FEVER dataset. Each evidence constitutes the Wiki-title and a corresponding sentence. The two named entities ( ent0 , ent1 ) are highlighted
  • Table2: Example from the compiled RuleTaker-CWA FUTES respectively (Q1, Q2, Q3 in Table 2). To and RuleTaker-Skip–fact
  • Table3: Distribution of compiled RuleTaker datasets
  • Table4: An example from the FEVER dataset. Wikipedia page titles for the evidence sentences are mentioned in parentheses. Even though the original dataset contains both evidence sentences within a single evidence set, we can label the given claim using just the first evidence sentence. Such cases would result in erroneous labels when creating Skip–fact version of FEVER
  • Table5: RuleTaker results on individual in-domain test sets. Note, these are separate test sets
  • Table6: Label Accuracy on Standard (Std.), Symmetric (Symm.) and Anonymized (Anon.) dev sets. We highlight the best results in each row (evaluation set)
  • Table7: Performance of BERT–concat model trained on anonymized FEVER train dataset. We report the accuracies on anonymized dev set
Download tables as Excel
Related work
Funding
  • Given a question and a context comprising of a set of simple facts and rules in natural language, models are expected to reason only based on the provided context, thereby emulating the ability to perform closed-world reasoning. They propose a synthetic training dataset (henceforth referred to as RuleTaker dataset) to fine-tune pre-trained models like RoBERTa. They observe high performances (≥95% accuracy) on the synthetic test set, motivating us to adapt a similar training methodology for FEVER
Study subjects and analysis
datasets: 3
For each of the three models, BERT–concat, Transformer-XH, and KGAT, we show results on the three different training curricula, Original, CWA, and Skip–fact in Table 6. We evaluate all our trained models on three datasets, the official devset of FEVER task (Std.), symmetric FEVER v0.2. 5We use BertAdam optimizer with learning rate 3e-5, train for ten epochs and choose the best checkpoint based on dev label accuracy

adversarial claim-evidence pairs: 3
(Schuster et al, 2019) (Symm.), and our proposed anonymized version of Std. (Anon.). Symmetric FEVER proposed by Schuster et al (2019) constructs three adversarial claim-evidence pairs based on the original pair from the FEVER dev set. On most evaluation sets, we found the models trained with Original curriculum performed better than our proposed curricula (CWA, Skip–fact) except on symmetric FEVER where Transformer-XH with Skip–fact does slightly better

Reference
  • Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAIPRICAI-20. International Joint Conferences on Artificial Intelligence Organization.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Claudia Schulz, and Iryna Gurevych. 2018. UKP-athene: Multi-sentence textual entailment for claim verification. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 103–108, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 201Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1693–1701, Cambridge, MA, USA. MIT Press.
    Google ScholarLocate open access versionFindings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nayeon Lee, Belinda Li, Sinong Wang, Wen-tau Yih, Hao Ma, and Madian Khabsa. 2020. Language models as fact checkers? In Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER), pages 36–41, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2020. Fine-grained fact verification with kernel graph attention network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7342–7351, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel Roberto Filizzola Ortiz, Enrico Santus, and Regina Barzilay. 2019. Towards debiasing fact verification models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3419–3425, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Amir Soleimani, Christof Monz, and Marcel Worring. 2020. Bert for evidence retrieval and claim verification. In Advances in Information Retrieval, pages 359–366, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2019.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments