Learning from others' mistakes: Avoiding dataset biases without modeling them

ICLR 2021, 2021.

Cited by: 0|Bibtex|Views19
Other Links: arxiv.org
Weibo:
Reducing a model's reliance on dataset biases by encouraging a robust model to learn from a weak learner's mistakes.

Abstract:

State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We consider cases where the bias is...More

Code:

Data:

0
Introduction
  • The natural language processing community has made tremendous progress in using pre-trained language models to improve predictive accuracy (Devlin et al, 2019; Raffel et al, 2019).
  • Gururangan et al (2018); Poliak et al (2018); Tsuchiya (2018) show that a model trained solely on the hypothesis, completely ignoring the intended signal, reaches strong performance
  • The authors refer to these surface patterns as dataset biases since the conditional distribution of the labels given such biased features is likely to change in examples outside the training data distribution (as formalized by He et al (2019))
Highlights
  • The natural language processing community has made tremendous progress in using pre-trained language models to improve predictive accuracy (Devlin et al, 2019; Raffel et al, 2019)
  • Our contributions are the following: (a) we show that weak learners are prone to relying on shallow heuristics and highlight how they rediscover previously human-identified dataset biases; (b) we demonstrate that we do not need to explicitly know or model dataset biases to train more robust models that generalize better to out-of-distribution examples; (c) we discuss the design choices for weak learners and show trade-offs between higher out-of-distribution performance at the expense of the in-distribution performance
  • In practice, that remains a challenge in natural language processing (Linzen, 2020; Yogatama et al, 2019) and our work aims at out-of-distribution robustness without significantly compromising in-distribution performance
  • Our training approach can be decomposed in two successive stages: (a) training the weak learner fW with a standard cross-entropy loss (CE) and (b) training a main model fM via product of experts (PoE) to learn from the errors of the weak learner
  • We have presented an effective method for training models robust to dataset biases
  • Leveraging a weak learner with limited capacity and a modified product of experts training setup, we show that dataset biases do not need to be explicitly known or modeled to be able to train models that can generalize significantly better to out-of-distribution examples
Methods
  • The authors' approach utilizes product of experts (Hinton, 2002) to factor dataset biases out of a learned model.
  • The product of experts ensemble of fW and fM produces logits vector e.
  • The authors' training approach can be decomposed in two successive stages: (a) training the weak learner fW with a standard cross-entropy loss (CE) and (b) training a main model fM via product of experts (PoE) to learn from the errors of the weak learner.
  • The core intuition of this method is to encourage the robust model to learn to make predictions that take into account the weak learner’s mistakes
Results
  • Results on

    SQuAD v1.1 and Adversarial SQuAD are listed in Table 3.
  • The weak learner alone has low performance both on in-distribution and adversarial sets.
  • PoE training improves the adversarial performance (+1% on AddSent) while sacrificing some in-distribution performance.
  • A multi-loss optimization closes the gap and even boosts adversarial robustness (+3% on AddSent and +2% on AddOneSent).
  • In contrast to the experiments on MNLI/HANS, multi-loss training leads here to better performance on out-of-distribution as well.
  • Multi-loss in this case allows them to strike a balance between learning from, or removing, the weak learner
Conclusion
  • Leveraging a weak learner with limited capacity and a modified product of experts training setup, the authors show that dataset biases do not need to be explicitly known or modeled to be able to train models that can generalize significantly better to out-of-distribution examples.
  • The authors discuss the design choices for such weak learner and investigate how using higher-capacity learners leads to higher out-ofdistribution performance and a trade-off with in-distribution performance
  • The authors believe that such approaches capable of automatically identifying and mitigating datasets bias will be essential tools for future bias-discovery and mitigation techniques
Summary
  • Introduction:

    The natural language processing community has made tremendous progress in using pre-trained language models to improve predictive accuracy (Devlin et al, 2019; Raffel et al, 2019).
  • Gururangan et al (2018); Poliak et al (2018); Tsuchiya (2018) show that a model trained solely on the hypothesis, completely ignoring the intended signal, reaches strong performance
  • The authors refer to these surface patterns as dataset biases since the conditional distribution of the labels given such biased features is likely to change in examples outside the training data distribution (as formalized by He et al (2019))
  • Methods:

    The authors' approach utilizes product of experts (Hinton, 2002) to factor dataset biases out of a learned model.
  • The product of experts ensemble of fW and fM produces logits vector e.
  • The authors' training approach can be decomposed in two successive stages: (a) training the weak learner fW with a standard cross-entropy loss (CE) and (b) training a main model fM via product of experts (PoE) to learn from the errors of the weak learner.
  • The core intuition of this method is to encourage the robust model to learn to make predictions that take into account the weak learner’s mistakes
  • Results:

    Results on

    SQuAD v1.1 and Adversarial SQuAD are listed in Table 3.
  • The weak learner alone has low performance both on in-distribution and adversarial sets.
  • PoE training improves the adversarial performance (+1% on AddSent) while sacrificing some in-distribution performance.
  • A multi-loss optimization closes the gap and even boosts adversarial robustness (+3% on AddSent and +2% on AddOneSent).
  • In contrast to the experiments on MNLI/HANS, multi-loss training leads here to better performance on out-of-distribution as well.
  • Multi-loss in this case allows them to strike a balance between learning from, or removing, the weak learner
  • Conclusion:

    Leveraging a weak learner with limited capacity and a modified product of experts training setup, the authors show that dataset biases do not need to be explicitly known or modeled to be able to train models that can generalize significantly better to out-of-distribution examples.
  • The authors discuss the design choices for such weak learner and investigate how using higher-capacity learners leads to higher out-ofdistribution performance and a trade-off with in-distribution performance
  • The authors believe that such approaches capable of automatically identifying and mitigating datasets bias will be essential tools for future bias-discovery and mitigation techniques
Tables
  • Table1: Breakdown of the 1,000 top certain a model limited to only the hypothesis in NLI (Gu- / incorrect training examples. rurangan et al, 2018). These approaches are effective, but require isolating specific biases present in Category
  • Table2: MNLI matched dev accuracies, HANS accuracies and MNLI matched hard test set. All numbers are averaged on 6 runs (with standard deviations). Detailed results on HANS are given Appendix A.4. Reported results are indicated with *. ♣ <a class="ref-link" id="cUtama_et+al_2020_a" href="#rUtama_et+al_2020_a">Utama et al (2020</a>) is a concurrent work where they use a BERT-base fine-tuned on 2000 random examples from MNLI as a weak learner and “PoE + An.” refers to the annealing mechanism proposed by the authors
  • Table3: F1 Scores on SQuAD and Adversarial QA. The AddOneSent set is model agnostic while we use the AddSent set obtained using an ensemble of BiDAF models (<a class="ref-link" id="cSeo_et+al_2017_a" href="#rSeo_et+al_2017_a">Seo et al, 2017</a>). * are reported results
  • Table4: Accuracies on the FEVER dev set (<a class="ref-link" id="cThorne_et+al_2018_a" href="#rThorne_et+al_2018_a">Thorne et al, 2018</a>) and symmetric hard test (<a class="ref-link" id="cSchuster_et+al_2019_a" href="#rSchuster_et+al_2019_a">Schuster et al, 2019</a>)
  • Table5: Weak learners are able to detect previously reported dataset biases without explicitly modeling them
  • Table6: HANS results per heuristic. All numbers are an average on 6 runs (with standard deviations)
  • Table7: Accuracies on the MNLI dev mismatched and HARD test mismatched set
  • Table8: Transfer accuracies on NLI benchmarks. We train BERT-base on SNLI and tested on the target test set. * are reported results
Download tables as Excel
Related work
  • Many studies have reported dataset biases in various settings. Examples include visual question answering (Jabri et al, 2016; Zhang et al, 2016), story completion (Schwartz et al, 2017), and reading comprehension (Kaushik & Lipton, 2018; Chen et al, 2016). Towards better evaluation methods, researchers have proposed to collect “challenge” datasets that account for surface correlations a model might adopt (Jia & Liang, 2017; McCoy et al, 2019b). Standard models without specific robust training methods often drop in performance when evaluated on these challenge sets.

    While these works have focused on data collection, another approach is to develop methods allowing models to ignore dataset biases during training. Several active areas of research tackle this challenge by adversarial training (Belinkov et al, 2019; Stacey et al, 2020), example forgetting (Yaghoobzadeh et al, 2019) and dynamic loss adjustment (Cadene et al, 2019). Previous work (He et al, 2019; Clark et al, 2019; Mahabadi et al, 2020) has shown the effectiveness of product of experts to train un-biased models. In our work, we show that we do not need to explicitly model biases to apply these de-biasing methods and can use a more general setup than previously presented.
Funding
  • When trained on MNLI, it gets a 67% accuracy on the matched development set (compared to 84% for BERT-base)
  • When trained jointly with the larger MediumBERT weak learner (41.4 million parameters), the main model gets 97% accuracy on HANS’s heuristic-non-entailed set but a very low accuracy on the in-distribution examples (28% on MNLI and 3% on the heuristic-entailed examples)
Study subjects and analysis
cases: 3
Figure 1b shows the concentration of training examples of MNLI (Williams et al, 2018) projected on the 2D coordinates (correctness, certainty) from a trained weak learner (described in Section 4.1). We observe that there are many examples for the 3 cases. More crucially, we verify that the group certain / incorrect is not empty since the examples in this group encourage the model to not rely on the dataset biases

Reference
  • Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neural machine translation. ArXiv, abs/1711.02173, 2018.
    Findings
  • Yonatan Belinkov, Adam Poliak, S. Shieber, Benjamin Van Durme, and Alexander M. Rush. Don’t take the premise for granted: Mitigating artifacts in natural language inference. ArXiv, abs/1907.04380, 2019.
    Findings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. ArXiv, abs/1508.05326, 2015.
    Findings
  • Remi Cadene, Corentin Dancette, Hedi Ben-younes, M. Cord, and D. Parikh. Rubi: Reducing unimodal biases in visual question answering. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Danqi Chen, J. Bolton, and Christopher D. Manning. A thorough examination of the cnn/daily mail reading comprehension task. ArXiv, abs/1606.02858, 2016.
    Findings
  • Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. ArXiv, abs/1909.03683, 2019.
    Findings
  • J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • T. Furlanello, Zachary Chase Lipton, Michael Tschannen, L. Itti, and Anima Anandkumar. Born again neural networks. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • He He, Sheng Zha, and Haohan Wang. Unlearn dataset bias in natural language inference by fitting the residual. In DeepLo@EMNLP-IJCNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks, X. Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and D. Song. Pretrained transformers improve out-of-distribution robustness. In ACL, 2020.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Oriol Vinyals, and J. Dean. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
    Findings
  • A. Jabri, Armand Joulin, and L. V. D. Maaten. Revisiting visual question answering baselines. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. ArXiv, abs/1707.07328, 2017.
    Findings
  • Divyansh Kaushik and Zachary Chase Lipton. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Tushar Khot, A. Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Tal Linzen. How can we accelerate progress towards human-like linguistic generalization? In ACL, 2020.
    Google ScholarLocate open access versionFindings
  • Rabeeh Karimi Mahabadi, Yonatan Belinkov, and J. Henderson. End-to-end bias mitigation by modelling biases in corpora. In ACL, 2020.
    Google ScholarLocate open access versionFindings
  • R. T. McCoy, Junghyun Min, and Tal Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. ArXiv, abs/1911.02969, 2019a.
    Findings
  • R. T. McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. ArXiv, abs/1902.01007, 2019b.
    Findings
  • Junghyun Min, R. T. McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. Syntactic data augmentation increases robustness to inference heuristics. In ACL, 2020.
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020.
    Google ScholarFindings
  • Ellie Pavlick and Chris Callison-Burch. Most ”babies” are ”little” and most ”problems” are ”huge”: Compositional entailment in adjective-nouns. In ACL, 2016.
    Google ScholarLocate open access versionFindings
  • Ellie Pavlick, T. Wolfe, Pushpendre Rastogi, Chris Callison-Burch, Mark Dredze, and Benjamin Van Durme. Framenet+: Fast paraphrastic tripling of framenet. In ACL, 2015.
    Google ScholarLocate open access versionFindings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. ArXiv, abs/1805.01042, 2018.
    Findings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, W. Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019.
    Findings
  • Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: The winograd schema challenge. In EMNLP-CoNLL, 2012.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. ArXiv, abs/1606.05250, 2016.
    Findings
  • Drew Reisinger, Rachel Rudinger, Francis Ferraro, Craig Harman, Kyle Rawlins, and Benjamin Van Durme. Semantic proto-roles. Transactions of the Association for Computational Linguistics, 3:475–488, 2015. doi: 10.1162/tacl a 00152. URL https://www.aclweb.org/anthology/Q15-1034.
    Locate open access versionFindings
  • Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Darsh J. Shah, Yun Jie Serene Yeo, Daniel Filizzola, Enrico Santus, and R. Barzilay. Towards debiasing fact verification models. 2019.
    Google ScholarFindings
  • Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the roc story cloze task. ArXiv, abs/1702.01841, 2017.
    Findings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. ArXiv, abs/1611.01603, 2017.
    Findings
  • Abhinav Shrivastava, A. Gupta, and Ross B. Girshick. Training region-based object detectors with online hard example mining. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761–769, 2016.
    Google ScholarLocate open access versionFindings
  • Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Sebastian Riedel, and Tim Rocktaschel. There is strength in numbers: Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training. ArXiv, abs/2004.07790, 2020.
    Findings
  • Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. 2020.
    Google ScholarFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and A. Mittal. Fever: a large-scale dataset for fact extraction and verification. In NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • D. Tsipras, Shibani Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arXiv: Machine Learning, 2019.
    Google ScholarFindings
  • M. Tsuchiya. Performance impact caused by hidden bias of training data for recognizing textual entailment. ArXiv, abs/1804.08117, 2018.
    Findings
  • Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models. arXiv: Computation and Language, 2019.
    Google ScholarFindings
  • Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. Towards debiasing nlu models from unknown biases, 2020.
    Google ScholarFindings
  • Kailas Vodrahalli, K. Li, and Jitendra Malik. Are all training examples created equal? an empirical study. ArXiv, abs/1811.12569, 2018.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, F. Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ArXiv, abs/1804.07461, 2018.
    Findings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, F. Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537, 2019.
    Findings
  • Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In CoNLL, 2017.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
    Locate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.
    Findings
  • Qizhe Xie, E. Hovy, Minh-Thang Luong, and Quoc V. Le. Self-training with noisy student improves imagenet classification. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2020.
    Google ScholarLocate open access versionFindings
  • Yadollah Yaghoobzadeh, R. Tachet, Timothy J. Hazen, and Alessandro Sordoni. Robust natural language inference models with example forgetting. ArXiv, abs/1911.03861, 2019.
    Findings
  • Dani Yogatama, Cyprien de Masson d’Autume, J. Connor, Tomas Kocisky, M. Chrzanowski, Lingpeng Kong, A. Lazaridou, W. Ling, L. Yu, Chris Dyer, and P. Blunsom. Learning and evaluating general linguistic intelligence. ArXiv, abs/1901.11373, 2019.
    Findings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Hongyang Zhang, Yaodong Yu, J. Jiao, E. Xing, L. Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • P. Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5014–5022, 2016.
    Google ScholarLocate open access versionFindings
  • Jieyu Zhao, Tianlu Wang, Mark Yatskar, V. Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. ArXiv, abs/1804.06876, 2018.
    Findings
  • Our code is based on the Hugging Face Transformers library (Wolf et al., 2019). All of our experiments are conducted on single 16GB V100 using half-precision training for speed.
    Google ScholarLocate open access versionFindings
  • NLI We fine-tuned a pre-trained TinyBERT (Turc et al., 2019) as our weak learner. We use the following hyper-parameters: 3 epochs of training with a learning rate of 3e−5, and a batch size of 32. The learning rate is linearly increased for 2000 warming steps and linearly decreased to 0 afterward. We use an Adam optimizer β = (0.9, 0.999), = 1e−8 and add a weight decay of 0.1. Our robust model is BERT-base-uncased and uses the same hyper-parameters. When we train a robust model with a multi-loss objective, we give the standard CE a weight of 0.3 and the PoE cross-entropy a weight of 1.0. Because of the high variance on HANS (McCoy et al., 2019a), we average numbers on 6 runs with different seeds.
    Google ScholarLocate open access versionFindings
  • SQuAD We fine-tuned a pre-trained TinyBERT (Turc et al., 2019) as our weak learner. We use the following hyper-parameters: 3 epochs of training with a learning rate of 3e−5, and a batch size of 16. The learning rate is linearly increased for 1500 warming steps and linearly decreased to 0 afterward. We use an Adam optimizer β = (0.9, 0.999), = 1e−8 and add a weight decay of 0.1. Our robust model is BERT-base-uncased and uses the same hyper-parameters. When we train a robust model with a multi-loss objective, we give the standard CE a weight of 0.3 and the PoE cross-entropy a weight of 1.0.
    Google ScholarFindings
  • Fact verification We fine-tuned a pre-trained TinyBERT (Turc et al., 2019) as our weak learner. We use the following hyper-parameters: 3 epochs of training with a learning rate of 2e−5, and a batch size of 32. The learning rate is linearly increased for 1000 warming steps and linearly decreased to 0 afterward. We use an Adam optimizer β = (0.9, 0.999), = 1e−8 and add a weight decay of 0.1. Our robust model is BERT-base-uncased and uses the same hyper-parameters. When we train a robust model with a multi-loss objective, we give the standard CE a weight of 0.3 and the PoE cross-entropy a weight of 1.0. We average numbers on 6 runs with different seeds.
    Google ScholarFindings
  • Following Mahabadi et al. (2020), we also experiment on a fact verification dataset. The FEVER dataset (Thorne et al., 2018) contains claim-evidences pairs generated from Wikipedia. Schuster et al. (2019) collected a new evaluation set for the FEVER dataset to avoid the biases observed in the claims of the benchmark. The authors symmetrically augment the claim-evidences pairs of the FEVER evaluation to balance the detected artifacts such that solely relying on statistical cues in claims would lead to a random guess. The collected dataset is challenging, and the performance of the models relying on biases evaluated on this dataset drops significantly.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments