Generating Fact Checking Briefs

Angela Fan
Angela Fan
Aleksandra Piktus
Aleksandra Piktus
Guillaume Wenzek
Guillaume Wenzek
Marzieh Saeidi
Marzieh Saeidi

EMNLP 2020, 2020.

被引用0|引用|浏览20
其它链接arxiv.org
关键词
question answeringprofessional factneural question generationNatural QuestionsDense Passage Retriever更多(9+)
微博一下
In adapting BART for question generation based on claims, we explore three options: generating all questions based only on the claim, generating all questions based on the claim and the source of the claim, and generating questions one at a time

摘要

Fact checking at scale is difficult -- while the number of active fact checking websites is growing, it remains too small for the needs of the contemporary media ecosystem. However, despite good intentions, contributions from volunteers are often error-prone, and thus in practice restricted to claim detection. We investigate how to incr...更多

代码

数据

0
简介
  • Volunteers on the other hand are not considered accurate enough; with access to a search engine, Roitero et al (2020) report crowdsourced fact check accuracies of around 58%.
  • This result corroborates earlier reports1 by fact checking websites which attempted to engage volunteers, but reported success only for claim detection, which is considered a much simpler task (Konstantinovskiy et al, 2018).
  • This is problematic, both from the perspective of using crowdsourced fact checking to combat misinformation and from the perspective of helping individuals fact check themselves
重点内容
  • We introduce QABRIEFER, a novel model that performs structured generation via claim-conditioned question generation and open domain question answering
  • We introduce the notion of briefs to provide relevant information to fact checkers—as if briefing them before fact checking— and explore three possible forms: Passage Briefs, Entity Briefs, and Question Answering Briefs
  • Dense Passage Retriever (DPR) is trained on Wikipedia, and we found the best performance within this domain
  • In adapting BART for question generation based on claims, we explore three options: generating all questions based only on the claim, generating all questions based on the claim and the source of the claim, and generating questions one at a time
  • QABriefs improve accuracy by 10% compared to using only a search bar while reducing the time a fact check takes
  • We find that first fine-tuning on a large question answering dataset, Natural Questions (NQ), and further fine-tuning on QABRIEFDATASET provides the best results
方法
  • The authors' main question is whether briefs can increase the accuracy and efficiency of fact checking.
  • To write questions one at a time, the model conditions on the previous questions as well as the claim and source, and needs to predict the subsequent question or an end of questions token.
  • Models take as input the question and evidence document that annotators indicated to contain the answer, and produce an answer.
  • As QABRIEFDATASET does not have enough data to train a question answering model from scratch, the authors use BART finetuned on Natural Questions.
  • As the dataset contains extractive and abstractive answers as well as questions where the
结果
  • The authors show in human evaluations that fact checking efficiency and accuracy are improved with briefs.
  • Volunteer Evaluators Crowdsourced evaluation is scalable, but crowdworkers may be less motivated to spend a large amount of time fact checking.
  • The authors conduct a smaller scale study using graduate student volunteer evaluators, recruited by asking for those interested in the challenge of fact checking real claims themselves.
  • The authors do not evaluate Passage Briefs or Entity Briefs, as the authors found volunteer fact checking to be less scalable than crowdsourcing
结论
  • While the experiments show a generally positive impact of briefs for human fact checking, it is important to put them into a broader perspective.

    Briefs for Professional Fact Checkers Crowdworkers and professional fact checkers perform different tasks under very different circumstances.
  • Professionals often investigate alternative interpretations and produce an explanation of their process in an article.
  • They often have years of experience and must check a variety of claims.
  • As the QABrief dataset was created using professional fact checking articles describing how a claim was checked, by decomposing a claim into multiple components, the authors can encourage a more structured fact checking process.The authors propose the concept of fact checking briefs, to be read before performing a fact check.
  • The authors show in extensive empirical studies with crowdworkers and volunteers that QABriefs can improve accuracy and efficiency of fact checking
总结
  • Introduction:

    Volunteers on the other hand are not considered accurate enough; with access to a search engine, Roitero et al (2020) report crowdsourced fact check accuracies of around 58%.
  • This result corroborates earlier reports1 by fact checking websites which attempted to engage volunteers, but reported success only for claim detection, which is considered a much simpler task (Konstantinovskiy et al, 2018).
  • This is problematic, both from the perspective of using crowdsourced fact checking to combat misinformation and from the perspective of helping individuals fact check themselves
  • Objectives:

    Instructions for Writing Questions: The authors' goal is to understand how a claim is fact checked.
  • Instructions for Validating Questions : The authors' goal is to understand the steps necessary to fact check a claim.
  • Instructions for Question Clarity : The authors' goal is to make sure each question is readable and could be used in a Google search to find an answer.
  • Instructions for Finding Answers: The authors' goal is to find answers to each of these questions
  • Methods:

    The authors' main question is whether briefs can increase the accuracy and efficiency of fact checking.
  • To write questions one at a time, the model conditions on the previous questions as well as the claim and source, and needs to predict the subsequent question or an end of questions token.
  • Models take as input the question and evidence document that annotators indicated to contain the answer, and produce an answer.
  • As QABRIEFDATASET does not have enough data to train a question answering model from scratch, the authors use BART finetuned on Natural Questions.
  • As the dataset contains extractive and abstractive answers as well as questions where the
  • Results:

    The authors show in human evaluations that fact checking efficiency and accuracy are improved with briefs.
  • Volunteer Evaluators Crowdsourced evaluation is scalable, but crowdworkers may be less motivated to spend a large amount of time fact checking.
  • The authors conduct a smaller scale study using graduate student volunteer evaluators, recruited by asking for those interested in the challenge of fact checking real claims themselves.
  • The authors do not evaluate Passage Briefs or Entity Briefs, as the authors found volunteer fact checking to be less scalable than crowdsourcing
  • Conclusion:

    While the experiments show a generally positive impact of briefs for human fact checking, it is important to put them into a broader perspective.

    Briefs for Professional Fact Checkers Crowdworkers and professional fact checkers perform different tasks under very different circumstances.
  • Professionals often investigate alternative interpretations and produce an explanation of their process in an article.
  • They often have years of experience and must check a variety of claims.
  • As the QABrief dataset was created using professional fact checking articles describing how a claim was checked, by decomposing a claim into multiple components, the authors can encourage a more structured fact checking process.The authors propose the concept of fact checking briefs, to be read before performing a fact check.
  • The authors show in extensive empirical studies with crowdworkers and volunteers that QABriefs can improve accuracy and efficiency of fact checking
表格
  • Table1: Statistics of QABRIEFDATASET
  • Table2: Question Generation Models
  • Table3: Question Answering Models
Download tables as Excel
相关工作
  • Previous work in NLP has focused on claim veracity. It has been treated as a classification problem (Wang, 2017), often using stance detection (Riedel et al, 2017). The FEVER Challenge (Thorne et al, 2018) proposed providing provenance for a decision along with classification, and various approaches developed combine information retrieval with stance detection or question answering (Li et al, 2018; Lee et al, 2018). Question generation and answering has been considered in the context of FEVER (Jobanputra, 2019) — the focus was on eliciting the right answer from a question answering system rather than improving the accuracy and efficiency of human fact checkers.

    However, FEVER is based on modified Wikipedia sentences, not real world claims, which are arguably more difficult. To address this Hanselowski et al (2019) considered the claims fact checked by the website Snopes, but used the reports accompanying them as evidence instead of finding the evidence directly. Popat et al (2018) and Augenstein et al (2019) used search engines, but without ensuring that they provide evidence supporting/refuting the claim instead of being related to it or that they were not fact checking reports. Finally, Kochkina et al (2018) used responses on social media for rumour verification, but did not address evidence finding.
基金
  • We investigate how to increase the accuracy and efficiency of fact checking by providing information about the claim before performing the check, in the form of natural language briefs
  • We show that fact checking with briefs — in particular QABriefs — increases the accuracy of crowdworkers by 10% while slightly decreasing the time taken
  • For volunteer (unpaid) fact checkers, QABriefs slightly increase accuracy and reduce the time required by around 20%
  • Volunteers on the other hand are not considered accurate enough; with access to a search engine, Roitero et al (2020) report crowdsourced fact check accuracies of around 58%
  • In this work, we propose briefs to increase the accuracy and efficiency of fact checking ( Figure 1)
  • In experiments with crowdworkers, QABriefs improve accuracy by 10% compared to using only a search bar while reducing the time a fact check takes
  • For volunteer fact checkers, accuracy is improved by 4% and the process is 20% faster compared to using a search bar
  • Our main question is whether briefs can increase the accuracy and efficiency of fact checking
  • To evaluate the quality of question answering, we use F1 score (Rajpurkar et al, 2016)
研究对象与分析
question and answer pairs: 3
To learn how to produce QABriefs and create training data, we use crowdsourcing to gather such briefs based on existing fact checks. We create QABRIEFDATASET, a collection of about 10,000 QABriefs with roughly 3 question and answer pairs each. We introduce QABRIEFER, a novel model that performs structured generation via claim-conditioned question generation and open domain question answering

引用论文
  • Michelle A Amazeen. 2016. Checking the factcheckers in 2008: Predicting political ad scrutiny and assessing consistency. Journal of Political Marketing, 15(4):433–464.
    Google ScholarLocate open access versionFindings
  • Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob Grue Simonsen. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4685–4697, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bence Bago, David G Rand, and Gordon Pennycook. 2020. Fake news, fast and slow: Deliberation reduces belief in false (but not true) news headlines. Journal of experimental psychology: general.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
    Google ScholarLocate open access versionFindings
  • Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342– 1352.
    Google ScholarLocate open access versionFindings
  • Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. 2017. Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874.
    Google ScholarLocate open access versionFindings
  • Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567.
    Google ScholarLocate open access versionFindings
  • D Graves. 201Understanding the promise and limits of automated fact-checking.
    Google ScholarFindings
  • Andreas Hanselowski, Christian Stab, Claudia Schulz, Zile Li, and Iryna Gurevych. 201A richly annotated corpus for different tasks in automated factchecking. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 493–503, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Naeemul Hassan, Bill Adair, James Hamilton, Chengkai Li, Mark Tremayne, Jun Yang, and Cong Yu. 2015. The quest to automate fact-checking. Proceedings of the 2015 Computation + Journalism Symposium.
    Google ScholarLocate open access versionFindings
  • Seth J Hill. 2017. Learning together slowly: Bayesian learning about political facts. The Journal of Politics, 79(4):1403–1418.
    Google ScholarLocate open access versionFindings
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. CoRR abs/1905.01969. External Links: Link Cited by, 2:2– 2.
    Findings
  • Mayank Jobanputra. 2019. Unsupervised question answering for fact-checking. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), pages 52–56, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
    Findings
  • Alireza Karduni. 2019. Human-misinformation interaction: Understanding the interdisciplinary approach needed to computationally combat false information. arXiv preprint arXiv:1903.07136.
    Findings
  • Alireza Karduni, Isaac Cho, Ryan Wesslen, Sashank Santhanam, Svitlana Volkova, Dustin L Arendt, Samira Shaikh, and Wenwen Dou. 2019. Vulnerable to misinformation? verifi! In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 312–323.
    Google ScholarLocate open access versionFindings
  • Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. CoRR, abs/2004.04906.
    Findings
  • Elena Kochkina, Maria Liakata, and Arkaitz Zubiaga. 20All-in-one: Multi-task learning for rumour verification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3402–3413, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lev Konstantinovskiy, Oliver Price, Mevan Babakar, and Arkaitz Zubiaga. 2018. Towards automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection.
    Google ScholarFindings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
    Google ScholarLocate open access versionFindings
  • Nayeon Lee, Chien-Sheng Wu, and Pascale Fung. 2018. Improving large-scale fact-checking using decomposable attention models and lexical tagging. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1133– 1138, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mike Lewis and Angela Fan. 2018. Generative question answering: Learning to answer the whole question.
    Google ScholarFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledgeintensive nlp tasks.
    Google ScholarFindings
  • Sizhen Li, Shuai Zhao, Bo Cheng, and Hao Yang. 2018. An end-to-end multi-task learning model for fact checking. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 138–144, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chloe Lim. 2018. Checking how fact-checkers check. Research & Politics, 5(3):2053168018786848.
    Google ScholarLocate open access versionFindings
  • Emma Lurie. 2019. The challenges of algorithmically assigning fact-checks: A sociotechnical examination of google’s reviewed claims.
    Google ScholarFindings
  • Morgan Marietta, David C Barker, and Todd Bowser. 2015. Fact-checking polarized politics: Does the fact-check industry provide consistent guidance on disputed realities? In The Forum, volume 13, pages 577–596.
    Google ScholarLocate open access versionFindings
  • Sebastiao Miranda, David Nogueira, Afonso Mendes, Andreas Vlachos, Andrew Secker, Rebecca Garrett, Jeff Mitchel, and Zita Marinho. 2019. Automated fact checking in the news room. In The World Wide Web Conference, pages 3579–3583.
    Google ScholarLocate open access versionFindings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: a human-generated machine reading comprehension dataset.
    Google ScholarFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
    Google ScholarLocate open access versionFindings
  • Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2018. DeClarE: Debunking fake news and false claims using evidence-aware deep learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 22–32, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Sudha Rao and Hal Daume III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Benjamin Riedel, Isabelle Augenstein, Georgios P Spithourakis, and Sebastian Riedel. 2017. A simple but tough-to-beat baseline for the fake news challenge stance detection task. arXiv preprint arXiv:1707.03264.
    Findings
  • Kevin Roitero, Michael Soprano, Shaoyang Fan, Damiano Spina, Stefano Mizzaro, and Gianluca Demartini. 2020. Can the crowd identify misinformation objectively? the effects of judgment scale and assessor’s background. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027.
    Findings
  • Gilmore, Nick B Adams, Emmanuel Vincent, Jennifer Lee, Martin Robbins, et al. 2018. A structured response to misinformation: Defining and annotating credibility indicators in news articles. In Companion Proceedings of the The Web Conference 2018, pages 603–612.
    Google ScholarLocate open access versionFindings
  • Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3901–3910.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktaschel, Douwe Kiela, Arthur Szlam, and Jason Weston. 2019. Learning to speak and act in a fantasy text adventure game. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 673–683.
    Google ScholarLocate open access versionFindings
  • Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pages 18–22, Baltimore, MD, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William Yang Wang. 2017. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2019. Zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814.
    Findings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
    Google ScholarLocate open access versionFindings
PDF 全文
您的评分 :
0

 

标签
评论