Fact or Fiction: Verifying Scientific Claims

David Wadden
David Wadden
Lucy Lu Wang
Lucy Lu Wang
Madeleine van Zuylen
Madeleine van Zuylen

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views186
Keywords:
scientific claim verificationMajor vault proteinresearch literatureclaim verificationClostridium difficile
Weibo:
Scientific claim verification presents a number of promising avenues for research on models capable of incorporating background information, reasoning about scientific processes, and assessing the strength and provenance of various evidence sources

Abstract:

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidenc...More

Code:

Data:

0
Introduction
  • Due to rapid growth in the scientific literature, it is difficult for researchers – and the general public even more so – to stay up to date on the latest findings.
  • Verifying scientific claims is challenging and requires domain-specific background knowledge – for instance, in order to identify the evidence supporting Claim 1 in Table 1, the system must determine that a reduction in coronavirus viral load indicates a favorable clinical response, even though this fact is never mentioned.
  • Compound claims like “Aerosolized coronavirus droplets can travel at least 6 feet and can remain in the air for 3 hours” should be split into two atomic claims
Highlights
  • Due to rapid growth in the scientific literature, it is difficult for researchers – and the general public even more so – to stay up to date on the latest findings
  • We introduce the task of scientific claim verification to evaluate the veracity of scientific claims against a scientific corpus
  • (4) We demonstrate the efficacy of our system in a real-world case study verifying claims about COVID-19 against the research literature
  • As illustrated in Figure 1, scientific claim verification is the task of identifying evidence from the research literature that SUPPORTS or REFUTES a given scientific claim
  • Scientific claim verification presents a number of promising avenues for research on models capable of incorporating background information, reasoning about scientific processes, and assessing the strength and provenance of various evidence sources
  • More severe COVID-19 infection is associated with higher mean troponin (SMD 0.53, 95% CI 0.30 to 0.75, p < 0.001)
  • This last challenge will be especially crucial for future work that seeks to verify scientific claims against sources other than the research literature – for instance, social media and the news
Methods
  • Nucleic Acids Research Plos Biology Plos Medicine Science.
  • Science Translational Medicine The Lancet Other Total.
  • C Dataset collection and corpus statistics C.1.
  • Source journals Table 8 shows the number of cited abstracts from each of the selected journals.
  • The “Other” category includes “co-cited” (§3.1) abstracts that came from journals not among the pre-defined set
Results
  • For LABELPREDICTION, the best performance is achieved by training first on the large FEVER dataset and fine-tuning on the smaller in-domain SCIFACT training set.
  • To understand the benefits of FEVER pretraining, the authors examined the claim / evidence pairs where the FEVER + SCIFACT- trained model made correct predictions but the SCIFACT- trained model did not.
  • For RATIONALESELECTION, training on SCIFACT alone produces the best results.
  • The authors examined the rationales that the SCIFACT- trained model identified but the FEVER- trained model missed, and found that they generally contain sciencespecific vocabulary.
  • Evidence: . . . feeding rapamycin to adult Drosophila produces life span extension . . .
Conclusion
  • Conclusion and future work

    Claim verification allows them to trace the sources and measure the veracity of scientific claims.
  • Scientific claim verification presents a number of promising avenues for research on models capable of incorporating background information, reasoning about scientific processes, and assessing the strength and provenance of various evidence sources.
  • This last challenge will be especially crucial for future work that seeks to verify scientific claims against sources other than the research literature – for instance, social media and the news.
  • The authors hope that the resources presented in this paper encourage future research on these important challenges, and help facilitate progress toward the broader goal of scientific document understanding
Summary
  • Introduction:

    Due to rapid growth in the scientific literature, it is difficult for researchers – and the general public even more so – to stay up to date on the latest findings.
  • Verifying scientific claims is challenging and requires domain-specific background knowledge – for instance, in order to identify the evidence supporting Claim 1 in Table 1, the system must determine that a reduction in coronavirus viral load indicates a favorable clinical response, even though this fact is never mentioned.
  • Compound claims like “Aerosolized coronavirus droplets can travel at least 6 feet and can remain in the air for 3 hours” should be split into two atomic claims
  • Methods:

    Nucleic Acids Research Plos Biology Plos Medicine Science.
  • Science Translational Medicine The Lancet Other Total.
  • C Dataset collection and corpus statistics C.1.
  • Source journals Table 8 shows the number of cited abstracts from each of the selected journals.
  • The “Other” category includes “co-cited” (§3.1) abstracts that came from journals not among the pre-defined set
  • Results:

    For LABELPREDICTION, the best performance is achieved by training first on the large FEVER dataset and fine-tuning on the smaller in-domain SCIFACT training set.
  • To understand the benefits of FEVER pretraining, the authors examined the claim / evidence pairs where the FEVER + SCIFACT- trained model made correct predictions but the SCIFACT- trained model did not.
  • For RATIONALESELECTION, training on SCIFACT alone produces the best results.
  • The authors examined the rationales that the SCIFACT- trained model identified but the FEVER- trained model missed, and found that they generally contain sciencespecific vocabulary.
  • Evidence: . . . feeding rapamycin to adult Drosophila produces life span extension . . .
  • Conclusion:

    Conclusion and future work

    Claim verification allows them to trace the sources and measure the veracity of scientific claims.
  • Scientific claim verification presents a number of promising avenues for research on models capable of incorporating background information, reasoning about scientific processes, and assessing the strength and provenance of various evidence sources.
  • This last challenge will be especially crucial for future work that seeks to verify scientific claims against sources other than the research literature – for instance, social media and the news.
  • The authors hope that the resources presented in this paper encourage future research on these important challenges, and help facilitate progress toward the broader goal of scientific document understanding
Tables
  • Table1: Evidence identified by our system as supporting and refuting two claims concerning COVID-19
  • Table2: Statistics on claim labels, and the number of evidence abstracts and rationales per claim
  • Table3: Comparison of different training datasets, encoders, and model inputs for RATIONALESELECTION and LABELPREDICTION, evaluated on the SCIFACT dev set. The claim-only model cannot select rationales
  • Table4: Test set performance on SCIFACT, according to the metrics from §4. For the “Oracle abstract” rows, the system is provided with gold evidence abstracts. “Oracle rationale” rows indicate that the gold rationales are provided as input. “Zero-shot” indicates zero-shot performance of a verification system trained on FEVER. Additionally, standard deviations are reported as subscripts for all F1 scores. See Appendix B for standard deviations on all reported metrics
  • Table5: Reasoning types required to verify SCIFACT claims which are classified incorrectly by our modeling baseline. Words crucial for correct verification are highlighted
  • Table6: Test set results as in Table 4, reporting mean and standard deviation over 10,000 bootstrap samples. Standard deviations are reported as subscripts. Some means reported here are slightly different from Table 4 due to sampling variability
  • Table7: Dev set results as in Table 4, reporting mean and standard deviation over 10,000 bootstrap samples
  • Table8: Number of cited documents by journal. Some co-cited articles (§3.1) come from journals outside our curated set; these are indicated by “Other”
Download tables as Excel
Related work
Funding
  • This research was supported by the ONR MURI N00014-18-1-2670, ONR N00014-18-1-2826, DARPA N66001-19-2-4031, NSF (IIS 1616112), Allen Distinguished Investigator Award, and the Sloan fellowship
Study subjects and analysis
papers: 5
The majority of citances used for SCIFACT cite only the seed article (no co-cited articles), as we found in initial annotation experiments that these citances tended to yield specific, easy-to-verify claims. To expand the corpus, we identify five papers cited in the same paper as each source citance but in a different paragraph, and add these to the corpus as distractor abstracts. These abstracts often has 1,000 questions), and information extraction (e.g. SciERC (Luan et al, 2018) has 500 annotated abstracts)

claim-abstract pairs: 232
SCIFACT claims are verified against abstracts rather than full articles since (1) abstracts can be annotated more scalably, (2) evidence is found in the abstract in more than 60% of cases, and (3) previous attempts at full-document annotation suffered from low annotator agreement (§7). Quality We assign 232 claim-abstract pairs for independent re-annotation. The label agreement is 0.75 Cohen’s κ, comparable with the 0.68 Fleiss’ κ reported in Thorne et al (2018), and 0.70 Cohen’s κ reported in Hanselowski et al (2019)

documents: 3
Although SCIBERT performs slightly better on rationale selection, using RoBERTa-large for both RATIONALESELECTION and LABELPREDICTION gave the best fullpipeline performance on the dev set, so we use RoBERTa-large for both components. For the ABSTRACTRETRIEVAL module, the best dev set fullpipeline performance was achieved by retrieving the top k = 3 documents. Model comparisons We report performance of three model variants

papers: 20
Unlike SCIFACT, these citations are not re-written into atomic claims and are therefore more difficult to verify. Expert annotators achieved very low (21.7%) inter-annotator agreement on the BioMedSumm dataset (Cohen et al, 2014), which contains 314 citations referencing 20 papers. Biomedical question answering datasets include BioASQ (Tsatsaronis et al, 2015) and PubMedQA (Jin et al, 2019), which contain 855 and 1,000 “yes / no” questions respectively (Gu et al, 2020)

Reference
  • Ramy Baly, Mitra Mohtarami, James Glass, Lluıs Marquez, Alessandro Moschitti, and Preslav Nakov. 2018. Integrating stance detection and fact checking in a unified corpus. In NAACL.
    Google ScholarFindings
  • Alberto Barron-Cedeno, Tamer Elsayed, Reem Suwaileh, Lluıs Marquez i Villodre, Pepa Atanasova, Wajdi Zaghouani, Spas Kyuchukov, Giovanni Da San Martino, and Preslav Nakov. 2018. Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. task 2: Factuality. In CLEF.
    Google ScholarFindings
  • Elaine Beller, Justin Clark, Guy Tsafnat, Clive Elliott Adams, Heinz Diehl, Hans Lund, Mourad Ouzzani, Kristina Thayer, James Thomas, Tari Turner, J. S. Xia, Karen A. Robinson, and Paul P Glasziou. 2018. Making progress with the automation of systematic reviews: principles of the international collaboration for the automation of systematic reviews (icasr). Systematic Reviews, 7.
    Google ScholarLocate open access versionFindings
  • Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: A pretrained language model for scientific text. In EMNLP.
    Google ScholarFindings
  • Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. ArXiv, abs/2004.05150.
    Findings
  • Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical significance in nlp. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Callison-Burch, and Dan Roth. 2019. Seeing things from a different angle: Discovering diverse perspectives about claims. In NAACL.
    Google ScholarFindings
  • Arman Cohan, Luca Soldaini, and Nazli Goharian. 2015. Matching citation text and cited spans in biomedical literature: a search-oriented approach. In NAACL.
    Google ScholarFindings
  • Kevin Bretonnel Cohen, Hoa Trang Dang, Anita de Waard, Prabha Yadav, and Lucy Vanderwende. 2014. Tac 2014 biomedical summarization track. https://tac.nist.gov/2014/BiomedSumm/.
    Findings
  • Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. 2017. SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours. In SemEval.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020a. Eraser: A benchmark to evaluate rationalized nlp models. In ACL.
    Google ScholarFindings
  • Jay DeYoung, Eric Lehman, Ben Nye, Iain James Marshall, and Byron C. Wallace. 2020b. Evidence inference 2.0: More data, better models. In BioNLP@ACL.
    Google ScholarLocate open access versionFindings
  • Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In ACL.
    Google ScholarLocate open access versionFindings
  • Bradley Efron and Robert Tibshirani. 1993. An introduction to the bootstrap.
    Google ScholarFindings
  • William Ferreira and Andreas Vlachos. 20Emergent: a novel data-set for stance classification. In NAACL.
    Google ScholarFindings
  • Yu Gu, Robert Tinn, Hao Cheng, M. Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2020. Domain-specific language model pretraining for biomedical natural language processing. ArXiv, abs/2007.15779.
    Findings
  • Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In ACL.
    Google ScholarFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
    Google ScholarLocate open access versionFindings
  • Andreas Hanselowski, Christian Stab, Claudia Schulz, Zile Li, and Iryna Gurevych. 2019. A richly annotated corpus for different tasks in automated factchecking. In CoNLL.
    Google ScholarFindings
  • Kokil Jaidka, Muthu Kumar Chandrasekaran, Devanshu Jain, and Min-Yen Kan. 2017. The cl-scisumm shared task 2017: Results and key insights. In BIRNDL@JCDL.
    Google ScholarFindings
  • Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. In EMNLP.
    Google ScholarFindings
  • Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C. Wallace. 2019. Inferring which medical treatments work from reports of clinical trials. In NAACL.
    Google ScholarFindings
  • Tao Lei, Regina Barzilay, and Tommi S. Jaakkola. 2016. Rationalizing neural predictions. In ACL.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692.
    Findings
  • Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S. Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In ACL.
    Google ScholarFindings
  • Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In EMNLP.
    Google ScholarFindings
  • Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In ACL Workshop on Language Technologies and Computational Social Science.
    Google ScholarLocate open access versionFindings
  • Iain James Marshall, Joel Kuiper, Edward Banner, and Byron C. Wallace. 2017. Automating biomedical evidence synthesis: Robotreviewer. ACL.
    Google ScholarLocate open access versionFindings
  • Iain James Marshall and Byron C. Wallace. 2019. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic Reviews, 8.
    Google ScholarLocate open access versionFindings
  • Preslav I Nakov, Ariel S Schwartz, and Marti Hearst. 2004. Citances: Citation sentences for semantic analysis of bioscience text. In SIGIR workshop on Search and Discovery in Bioinformatics.
    Google ScholarLocate open access versionFindings
  • Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain James Marshall, Ani Nenkova, and Byron C. Wallace. 2018. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In ACL.
    Google ScholarFindings
  • Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William. Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex D. Wade, Kuansan Wang, Christopher Wilhelm, Boya Xie, Douglas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. Cord-19: The covid-19 open research dataset. ArXiv, abs/2004.10706.
    Findings
  • William Yang Wang. 2017. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In ACL.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Kashyap Popat, Subhabrata Mukherjee, Jannik Strotgen, and Gerhard Weikum. 2017. Where the truth lies: Explaining the credibility of emerging claims on the web and social media. In WWW.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Darsh J. Shah, Yun Jie Serene Yeo, Daniel Filizzola, Enrico Santus, and Regina Barzilay. 2019. Towards debiasing fact verification models. In EMNLP.
    Google ScholarFindings
  • Amir Soleimani, Christof Monz, and Marcel Worring. 2019. Bert for evidence retrieval and claim verification. In European Conference on Information Retrieval.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. In NAACL.
    Google ScholarFindings
  • Guy Tsafnat, Paul P Glasziou, Miew Keen Choong, Adam G. Dunn, Filippo Galgani, and Enrico W. Coiera. 2014. Systematic review automation technologies. Systematic Reviews, 3:74 – 74.
    Google ScholarLocate open access versionFindings
  • George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artieres, Axel-Cyrille Ngonga Ngomo, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. In BMC Bioinformatics.
    Google ScholarFindings
  • All models are implemented using the Huggingface Transformers package (Wolf et al., 2019).
    Google ScholarFindings
  • We assess the uncertainty in the results reported in the main results (Table 4) using a simple bootstrap approach (Dror et al., 2018; Berg-Kirkpatrick et al., 2012; Efron and Tibshirani, 1993). Given our test set with ntest = 300 claims, we generate nboot = 10, 000 bootstrap-resampled test sets by resampling (uniformly, with replacement) ntest claims from the test set. For each resampled test set, we compute the metrics in Table 4. Table 6 reports the mean and standard deviation of these metrics, computed over the bootstrap samples. Table 7 reports dev set metrics. Our conclusion that training on SCIFACT improves performance is robust to the uncertainties presented in these tables.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments