Explainable Automated Fact Checking for Public Health Claims

Neema Kotonya
Neema Kotonya
Francesca Toni
Francesca Toni

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views21
Other Links: arxiv.org
Keywords:
fact checkinglocal coherencestrong global coherenceAlzheimer diseasecase studyMore(14+)
Weibo:
We explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg-

Abstract:

Fact-checking is the task of verifying the veracity of claims by assessing their assertions against credible evidence. The vast majority of fact-checking studies focus exclusively on political claims. Very little research explores fact-checking for other topics, specifically subject matters for which expertise is required. We present th...More

Code:

Data:

0
Introduction
  • A great amount of progress has been made in the area of automated fact-checking
  • This includes more accurate machine learning models for veracity prediction and datasets of both naturally occurring (Wang, 2017; Augenstein et al, 2019; Hanselowski et al, 2019) and human-crafted (Thorne et al, 2018) fact-checking claims, against which the models can be evaluated.
  • Unlike political and general misinformation, specific expertise is required in order to fact check claims in this domain.
  • Like political misinformation, the public health variety is potentially very dangerous, because it can put people in imminent danger and risk lives
Highlights
  • A great amount of progress has been made in the area of automated fact-checking
  • Unlike political and general misinformation, specific expertise is required in order to fact check claims in this domain
  • For our case study we examine the the public health context
  • We introduce a framework for generating explanations and veracity prediction specific to public health fact-checking
  • Whereas Sokol and Flach discuss coherence in general, we provide concrete definitions and use them for evaluating our methods for explaining veracity predictions for public health claims
  • We explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg
Methods
  • The authors describe in detail the methods the authors employed for devising automated fact-checking models.
  • The authors trained two fact-checking models: a classifier for veracity prediction, and a second summarization model for generating fact-checking explanations
  • The former returns the probability of an input claim text belonging to one of four classes: true, false, unproven, mixture.
  • Full details of hyperparameters chosen and computer infrastructure which was employed can be found in Appendix A.2.
Results
  • The authors conducted experiments to evaluate the performance of both predictor(s) and explainer(s).
  • Under Obamacare, patients 76 and older must be admitted to the hospital by their primary care physicians in order to be covered by Medicare.
  • Obamacare does not require that patients 76 and older must be admitted to the hospital by their primary care physicians in order to be covered by Medicare.
  • What’s true: nothing in the Affordable Care Act requires that a primary care physician admit patients 76 or older to a hospital in order for their hospital care to be treated under Medicare.
  • What’s false: none of the provisions or rules put an upper age limit on medicare coverage
Conclusion
  • Conclusion and Future work

    In this paper, the authors explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg-.
  • The authors explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg-
Summary
  • Introduction:

    A great amount of progress has been made in the area of automated fact-checking
  • This includes more accurate machine learning models for veracity prediction and datasets of both naturally occurring (Wang, 2017; Augenstein et al, 2019; Hanselowski et al, 2019) and human-crafted (Thorne et al, 2018) fact-checking claims, against which the models can be evaluated.
  • Unlike political and general misinformation, specific expertise is required in order to fact check claims in this domain.
  • Like political misinformation, the public health variety is potentially very dangerous, because it can put people in imminent danger and risk lives
  • Objectives:

    The system for veracity prediction the authors aim to produce must fulfil two requirements: (1) it should provide a human-understandable explanation for the fact-checking prediction, and (2) that judgement should be understandable for people who do not have expertise in the subject domain.
  • Methods:

    The authors describe in detail the methods the authors employed for devising automated fact-checking models.
  • The authors trained two fact-checking models: a classifier for veracity prediction, and a second summarization model for generating fact-checking explanations
  • The former returns the probability of an input claim text belonging to one of four classes: true, false, unproven, mixture.
  • Full details of hyperparameters chosen and computer infrastructure which was employed can be found in Appendix A.2.
  • Results:

    The authors conducted experiments to evaluate the performance of both predictor(s) and explainer(s).
  • Under Obamacare, patients 76 and older must be admitted to the hospital by their primary care physicians in order to be covered by Medicare.
  • Obamacare does not require that patients 76 and older must be admitted to the hospital by their primary care physicians in order to be covered by Medicare.
  • What’s true: nothing in the Affordable Care Act requires that a primary care physician admit patients 76 or older to a hospital in order for their hospital care to be treated under Medicare.
  • What’s false: none of the provisions or rules put an upper age limit on medicare coverage
  • Conclusion:

    Conclusion and Future work

    In this paper, the authors explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg-.
  • The authors explored fact-checking for claims for which specific expertise is required to produce a veracity prediction and explanations (i.e., judg-
Tables
  • Table1: Example of claims and explanations for PUBHEALTH dataset entries. Vocabulary from the public health glossary which are contained in the claims and explanations are in bold
  • Table2: Summary of the distribution of true (tru.), false (fal.), mixture (mix.) and unproven (unp.) veracity labels in PUBHEALTH, across the original sources from which data originated
  • Table3: Comparison of readability of claims presented in large fact-checking datasets (i.e., those with > 10K claims). We compute the mean and standard deviation for Flesch-Kincaid and Dale-Chall scores of claims for LIAR (<a class="ref-link" id="cWang_2017_a" href="#rWang_2017_a">Wang, 2017</a>), FEVER (<a class="ref-link" id="cThorne_et+al_2018_a" href="#rThorne_et+al_2018_a">Thorne et al, 2018</a>), MultiFC (<a class="ref-link" id="cAugenstein_et+al_2019_a" href="#rAugenstein_et+al_2019_a">Augenstein et al, 2019</a>), FAKENEWSNET (Shu et al, 2019b), and also our own fact-checking dataset. The sample sizes used for evaluation for each dataset are as follows, LIAR: 12,791, MultiFC: 34,842, FAKENEWSNET: 23,196, FEVER: 145,449, and 11,832 for our dataset
  • Table4: Veracity prediction results for the two baselines and four BERT-based models on the test set. Model performance is assessed against precision (Pr.), recall (Rc.), macro F1, and accuracy (Acc.) metrics
  • Table5: ROUGE-1 (R1), ROUGE-2 (R2) and ROUGEL (RL) F1 scores for explanations generated via our two explanation models
  • Table6: Table 6
  • Table7: Format of explanations scraped from factchecking (f ), news (n), news review (r) websites
  • Table8: Examples of tag metadata for entries in the PUBHEALTH dataset
  • Table9: These are the four standardized labels we defined for veracity prediction (left) and lists (right) of the original fact-checking labels provided by the fact-checking and news review websites we scraped, mapped to our four standardized labels
Download tables as Excel
Related work
  • A number of recent works in automated factchecking look at various formulations of factchecking and its analogous tasks (Ferreira and Vlachos, 2016; Hassan et al, 2017; Zlatkova et al, 2019). In this paper, we choose to focus on the two specific aspects of concern to us, which have not been thoroughly explored in the literature. These are domain-specific and expertise-based claim verification and explainability for automated factchecking predictions.

    2.1 Language Representations for Health

    Fewer language resources exist for medical and scientific applications of NLP compared with other NLP application settings, e.g., social media analysis, NLP for law, and computational journalism and fact-checking. We consider the former below.

    There are a number of open source pre-trained language models for NLP applications in the scientific and biomedical domains. The most recent of these pre-trained models are based on the BERT language model (Devlin et al, 2019). One example is BIOBERT, which is fine-tuned for the biomedical setting (Lee et al, 2020). BIOBERT is trained on abstracts from PubMed and full article texts from PubMed Central. BIOBERT demonstrates higher accuracies when compared to BERT for named entity recognition, relation extraction and question answering in the biomedical domain.
Funding
  • The first author is supported by a doctoral training grant from the UK Engineering and Physical Sciences Research Council (EPSRC)
Reference
  • Naser Ahmadi, Joohyung Lee, Paolo Papotti, and Mohammed Saeed. 2019. Explainable fact checking with probabilistic answer set programming. arXiv preprint arXiv:1906.09198, abs/1906.09198.
    Findings
  • Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. 2020. Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7352–7364, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob Grue Simonsen. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4685–4697, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 201A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeanne Sternlicht Chall and Edgar Dale. 1995. Readability revisited: The new Dale-Chall readability formula. Brookline Books.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1163–1168, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mohamed H Gad-Elrab, Daria Stepanova, Jacopo Urbani, and Gerhard Weikum. 201Exfakt: a framework for explaining facts over knowledge graphs and text. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 87–95. ACM.
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1– 6, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peter Grabitz, Yuri Lazebnik, Joshua Nicholson, and Sean Rife. 2017. Science with no fiction: measuring the veracity of scientific reports by citation analysis. BioRxiv, page 172940.
    Google ScholarLocate open access versionFindings
  • Lucas Graves. 2018. Boundaries not drawn: Mapping the institutional roots of the global fact-checking movement. Journalism Studies, 19(5):613–631.
    Google ScholarLocate open access versionFindings
  • Andreas Hanselowski, Christian Stab, Claudia Schulz, Zile Li, and Iryna Gurevych. 2019. A richly annotated corpus for different tasks in automated factchecking. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 493–503, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward automated factchecking: Detecting check-worthy factual claims by claimbuster. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1803–1812.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 20Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701.
    Google ScholarLocate open access versionFindings
  • J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, Naval Technical Training Command Millington TN Research Branch.
    Google ScholarFindings
  • Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarFindings
  • Yang Liu and Mirella Lapata. 20Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
    Google ScholarFindings
  • Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2019b. Fakenewsnet: A data repository with news content, social context and spatialtemporal information for studying fake news on social media.
    Google ScholarFindings
  • Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6859–6866.
    Google ScholarLocate open access versionFindings
  • Kacper Sokol and Peter Flach. 2019. Desiderata for interpretability: Explaining decision tree predictions with counterfactuals. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 10035–10036.
    Google ScholarLocate open access versionFindings
  • Ankur Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2018. DeClarE: Debunking fake news and false claims using evidence-aware deep learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 22–32, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Online submission.
    Google ScholarFindings
  • Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using Siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bernard Turnock. 2012. Public Health: What It Is and How It Works. Jones & Bartlett Publishers, Gaithersburg, Md.
    Google ScholarFindings
  • Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pages 18–22, Baltimore, MD, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William Yang Wang. 2017. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144.
    Google ScholarLocate open access versionFindings
  • Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019a. defend: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 395– 405, New York, NY, USA. ACM.
    Google ScholarLocate open access versionFindings
  • Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2019. Reasoning over semantic-level graph for fact checking.
    Google ScholarFindings
  • Dimitrina Zlatkova, Preslav Nakov, and Ivan Koychev. 2019. Fact-checking meets fauxtography: Verifying claims about images. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments