Hierarchical Evidence Set Modeling for Automated Fact Extraction and Verification

Shyam Subramanian
Shyam Subramanian
Kyumin Lee
Kyumin Lee

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views15
Other Links: arxiv.org
Keywords:
fact extractionsocial mediumlanguage inferenceaccurate claimLabel AccuracyMore(13+)
Weibo:
Our experiments confirm that our hierarchical evidence set modeling outperforms 7 state-of-the-art baselines, producing more accurate claim verification

Abstract:

Automated fact extraction and verification is a challenging task that involves finding relevant evidence sentences from a reliable corpus to verify the truthfulness of a claim. Existing models either (i) concatenate all the evidence sentences, leading to the inclusion of redundant and noisy information; or (ii) process each claim-eviden...More

Code:

Data:

0
Introduction
  • A study by Gabielkov et al (2016) has revealed that 60% of people on social media share the news after reading just the title, without reading the actual content of the news.
  • On the other hand, processing each evidence sentence separately, delays the combination of relevant sentences that belong to the same evidence set, for claims that require aggregating information from multiple sentences.
  • It makes claim verification harder since it summarizes information without complete context.
  • Each evidence set verifies the claim individually, and they are aggregated for the final verification
Highlights
  • A study by Gabielkov et al (2016) has revealed that 60% of people on social media share the news after reading just the title, without reading the actual content of the news
  • Our work focuses on automated fact extraction and verification task, which requires retrieving the evidence related to a claim as well as verifying the claim based on the evidence
  • We propose Hierarchical Evidence Set Modeling, which consists of document retriever, multi-hop evidence retriever, and claim verification
  • The claim verification experiment is conducted in the test set since each baseline’s officially evaluated results are reported in the FEVER leaderboard
  • Hierarchical Evidence Set Modeling (HESM) operates at evidence set level initially and combines information from all the evidence sets using hierarchical aggregation to verify the claim
  • Our experiments confirm that our hierarchical evidence set modeling outperforms 7 state-of-the-art baselines, producing more accurate claim verification
Methods
  • 5.1 Experiment Setting

    the authors describe the dataset, evaluation metrics, baselines, and implementation details in the experiments.

    Dataset.
  • The dataset consists of training, development, and test sets, as shown in Table 1.
  • The training and development sets, along with their ground truth evidence and labels are available publicly.
  • The ground truth evidence and labels of the test set are not publicly available.
  • Once extracted evidence sets/sentences and predicted labels of the test set by a model are submitted to the online evaluation system1, its performance is measured and displayed at the system.
  • The authors train and tune the hyper-parameters on training and development sets, respectively
Results
Conclusion
  • The authors have proposed HESM framework for automated fact extraction and verification.
  • HESM operates at evidence set level initially and combines information from all the evidence sets using hierarchical aggregation to verify the claim.
  • The authors' experiments confirm that the hierarchical evidence set modeling outperforms 7 state-of-the-art baselines, producing more accurate claim verification.
  • The authors' analysis of contextual and non-contextual aggregations illustrates that the aggregations perform different roles and positively contribute to different aspects of fact-verification
Summary
  • Introduction:

    A study by Gabielkov et al (2016) has revealed that 60% of people on social media share the news after reading just the title, without reading the actual content of the news.
  • On the other hand, processing each evidence sentence separately, delays the combination of relevant sentences that belong to the same evidence set, for claims that require aggregating information from multiple sentences.
  • It makes claim verification harder since it summarizes information without complete context.
  • Each evidence set verifies the claim individually, and they are aggregated for the final verification
  • Methods:

    5.1 Experiment Setting

    the authors describe the dataset, evaluation metrics, baselines, and implementation details in the experiments.

    Dataset.
  • The dataset consists of training, development, and test sets, as shown in Table 1.
  • The training and development sets, along with their ground truth evidence and labels are available publicly.
  • The ground truth evidence and labels of the test set are not publicly available.
  • Once extracted evidence sets/sentences and predicted labels of the test set by a model are submitted to the online evaluation system1, its performance is measured and displayed at the system.
  • The authors train and tune the hyper-parameters on training and development sets, respectively
  • Results:

    Experimental Results and Analysis

    Experiments are conducted to evaluate the performance of evidence retrieval, claim verification, and

    Model UKP Athene (Hanselowski et al, 2018b) UCL MRG (Yoneda et al, 2018) UNC NLP (Nie et al, 2019) BERT Pair (Zhou et al, 2019) BERT Concat (Zhou et al, 2019) BERT (Base) (Soleimani et al, 2020) GEAR (BERT Base) (Zhou et al, 2019) KGAT (BERT Base) (Liu et al, 2020) the HESM (BERT Base) the HESM (ALBERT Base) BERT (Large) (Soleimani et al, 2020) BERT (Large) (Stammbach and Neumann, 2019) KGAT (BERT Large) (Liu et al, 2020) KGAT (RoBERTa Large) (Liu et al, 2020) the HESM (ALBERT Large) aggregation approaches.
  • Experiments are conducted to evaluate the performance of evidence retrieval, claim verification, and.
  • As shown in Table 2, the authors compare the performance of the model with two baselines, UNC NLP (Nie et al, 2019) and BERT based model (Stammbach and Neumann, 2019).
  • Since most other previous works either use ESIM based model or BERT based model for evidence retrieval, the authors compare with these two representative baselines.
  • The authors can notice that multiple-hop evidence retrieval approaches (ours and Stammbach and Neumann (2019)) performed better than UNC NLP, which conducts a single iteration
  • Conclusion:

    The authors have proposed HESM framework for automated fact extraction and verification.
  • HESM operates at evidence set level initially and combines information from all the evidence sets using hierarchical aggregation to verify the claim.
  • The authors' experiments confirm that the hierarchical evidence set modeling outperforms 7 state-of-the-art baselines, producing more accurate claim verification.
  • The authors' analysis of contextual and non-contextual aggregations illustrates that the aggregations perform different roles and positively contribute to different aspects of fact-verification
Tables
  • Table1: Statistics of FEVER Dataset
  • Table2: Evidence retrieval performance of the baselines and our model in development set
  • Table3: Performance of the baselines and our model in test set
  • Table4: Claim verification with different aggregation methods in development set
  • Table5: Ablation analysis in development set
  • Table6: Attention analysis for contextual and noncontextual aggregation gregation performs better in identifying evidence that does not have enough information to support or refute the claim (i.e., claims with the label NOT ENOUGH INFO). Thus, each aggregation complements the other in claim verification
Download tables as Excel
Related work
Funding
  • This work was supported in part by NSF grant CNS-1755536, AWS Cloud Credits for Research, and Google Cloud
Study subjects and analysis
documents: 5416537
We evaluate our framework HESM in the FEVER dataset, a large scale fact verification dataset (Thorne et al, 2018a). The dataset consists of 185, 445 claims with human-annotated evidence sentences from 5, 416, 537 documents. Each claim is labeled as SUPPORTS, REFUTES, or NOT ENOUGH INFO

documents: 5416537
We evaluate our framework HESM in the FEVER dataset, a large scale fact verification dataset (Thorne et al, 2018a). The dataset consists of 185, 445 claims with human-annotated evidence sentences from 5, 416, 537 documents. Each claim is labeled as SUPPORTS, REFUTES, or NOT ENOUGH INFO

Reference
  • Hannah Bast, Bjorn Buchhold, and Elmar Haussmann. 2017. Overview of the triple scoring task at the wsdm cup 2017. ArXiv, abs/1712.08081.
    Findings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
    Google ScholarLocate open access versionFindings
  • Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. Tabfact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ido Dagan, Oren Glickman, and Bernardo Magnini. 200The pascal recognising textual entailment challenge. In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
    Google ScholarLocate open access versionFindings
  • William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    Google ScholarLocate open access versionFindings
  • Maksym Gabielkov, Arthi Ramachandran, Augustin Chaintreau, and Arnaud Legout. 2016. Social Clicks: What and Who Gets Read on Twitter? In ACM SIGMETRICS / IFIP Performance 2016.
    Google ScholarLocate open access versionFindings
  • Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Debanjan Chaudhuri, Christian M. Meyer, and Iryna Gurevych. 2018a. A retrospective analysis of the fake news challenge stance-detection task. In Proceedings of the 27th International Conference on Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Claudia Schulz, and Iryna Gurevych. 2018b. UKP-athene: Multisentence textual entailment for claim verification. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR) 2015.
    Google ScholarLocate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2020. Fine-grained fact verification with kernel graph attention network. In ACL.
    Google ScholarFindings
  • Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations.
    Google ScholarLocate open access versionFindings
  • Ndapandula Nakashole and Tom M. Mitchell. 2014. Language-aware truth assessment of fact candidates. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. Proceedings of the AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Amir Soleimani, Christof Monz, and Marcel Worring. 2020. Bert for evidence retrieval and claim verification. In Advances in Information Retrieval, pages 359–366. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Dominik Stammbach and Guenter Neumann. 20Team DOMLIN: Exploiting evidence enhancement for the FEVER shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER).
    Google ScholarLocate open access versionFindings
  • James Thorne and Andreas Vlachos. 2017. An extensible framework for verification of numerical claims. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018b. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER).
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17. Curran Associates Inc.
    Google ScholarLocate open access versionFindings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Andreas Vlachos and Sebastian Riedel. 2015. Identification and verification of simple claims about statistical properties. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • William Yang Wang. 2017. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
    Google ScholarLocate open access versionFindings
  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    Google ScholarLocate open access versionFindings
  • Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. UCL machine reading group: Four factor framework for fact finding (HexaF). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER).
    Google ScholarLocate open access versionFindings
  • Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2019. GEAR: Graph-based evidence aggregating and reasoning for fact verification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Baselines. We compare our model with 7 stateof-the-art baselines, including the top performed models from FEVER Shared task 1.0 (Thorne et al., 2018b), BERT based models, and a graph-based model.
    Google ScholarFindings
  • The top performed models from FEVER shared task 1.0 include UNC NLP (Nie et al., 2019), UKP Athene (Hanselowski et al., 2018a) and UCL MRG (Yoneda et al., 2018). All three models use a modified version of Enhanced Sequential Inference Model (Chen et al., 2017) for claim verification. UNC NLP model concatenates all retrieved evidence sentences together to verify the claim, whereas UCL MRG and UKP Athene models process each evidence sentence separately and aggregate them at a later stage. UCL MRG reports the best results with linear layer aggregation. UKP Athene uses an attention-based aggregation.
    Google ScholarLocate open access versionFindings
  • Detailed Implementation, Training and Hyperparameter Tuning. For training the document retriever, Adam optimizer (Kingma and Ba, 2015) is used with a batch size of 128, and cross-entropy loss is used. The maximum number of retrieved documents K1 is set to 10. In the Multi-hop evidence retrieval stage, the number of iterations N is set to 2. For both iterations, the ALBERT-Base model for sequence classification is used and is trained using a batch size of 64 along with AdamW optimizer (Loshchilov and Hutter, 2019) and a learning rate of 5e-5. In the first iteration, we set the threshold probability thevi1 as 0.5, and the maximum number of sentences per claim K2 to 3. We also use the annealed sampling strategy followed by Nie et al. (2019) to decrease the number of negative examples after each epoch so that model learns to be more tolerant about selecting sentences while being discriminative enough to filter out apparent negative sentences.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments