Adversarial Filters of Dataset Biases

Bhagavatula Chandra
Bhagavatula Chandra
Peters Matthew E.
Peters Matthew E.

ICML, pp. 1078-1088, 2020.

Cited by: 2|Bibtex|Views75
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
We presented a theoretical framework supporting AFLITE, and showed its effectiveness in bias reduction on synthetic and real datasets, providing extensive analyses

Abstract:

Large neural models have demonstrated human-level performance on language and vision benchmarks such as ImageNet and Stanford Natural Language Inference (SNLI). Yet, their performance degrades considerably when tested on adversarial or out-of-distribution samples. This raises the question of whether these models have learned to solve a ...More

Code:

Data:

0
Introduction
  • Large-scale neural networks have achieved superhuman performance across many popular AI benchmarks, for tasks as diverse as image recognition (ImageNet; Russakovsky et al., 2015), natural language inference (SNLI; Bowman et al, 2015), and question answering (SQuAD; Rajpurkar et al, 2016).
  • The performance of such neural models degrades considerably when tested on out-of-distribution or adversarial samples, otherwise known as data “in the wild” (Eykholt et al, 2018; Jia & Liang, 2017).
  • This phenomenon indicates that high performance of the strongest AI models is often confined to specific datasets, implicitly making a closed-world assumption.
  • Do dataset biases inevitably bias the models trained on them, but they have been shown to significantly inflate model performance, leading to an overestimation of the true capabilities of current AI systems (Sakaguchi et al, 2020; Hendrycks et al, 2019)
Highlights
  • Large-scale neural networks have achieved superhuman performance across many popular AI benchmarks, for tasks as diverse as image recognition (ImageNet; Russakovsky et al., 2015), natural language inference (SNLI; Bowman et al, 2015), and question answering (SQuAD; Rajpurkar et al, 2016)
  • We evaluate AFLITE for this criterion on the natural language inference task
  • We presented a deep-dive into AFLITE – an iterative greedy algorithm that adversarially filters out spurious biases from data for accurate benchmark estimation
  • We presented a theoretical framework supporting AFLITE, and showed its effectiveness in bias reduction on synthetic and real datasets, providing extensive analyses
  • We showed on out-of-distribution and adversarial test sets, models trained on the AFLITEfiltered subset generalize better
  • We hope that dataset creators will employ AFLITE to identify unobservable artifacts before releasing new challenge datasets for more reliable estimates of task progress on future AI benchmarks
Results
  • Leads to models with greater generalization, in comparison to training on a randomly sampled ImageNet of the same size, leading to up to 2% improvement in performance.
Conclusion
  • The authors presented a deep-dive into AFLITE – an iterative greedy algorithm that adversarially filters out spurious biases from data for accurate benchmark estimation.
  • The authors apply AFLITE to four datasets, including widely used benchmarks such as SNLI and ImageNet, and show that the strongest performance on the resulting filtered dataset drops by 30 points for SNLI and 20 points for ImageNet. The authors showed on out-of-distribution and adversarial test sets, models trained on the AFLITEfiltered subset generalize better.
  • All datasets and code for this work will be made public soon
Summary
  • Introduction:

    Large-scale neural networks have achieved superhuman performance across many popular AI benchmarks, for tasks as diverse as image recognition (ImageNet; Russakovsky et al., 2015), natural language inference (SNLI; Bowman et al, 2015), and question answering (SQuAD; Rajpurkar et al, 2016).
  • The performance of such neural models degrades considerably when tested on out-of-distribution or adversarial samples, otherwise known as data “in the wild” (Eykholt et al, 2018; Jia & Liang, 2017).
  • This phenomenon indicates that high performance of the strongest AI models is often confined to specific datasets, implicitly making a closed-world assumption.
  • Do dataset biases inevitably bias the models trained on them, but they have been shown to significantly inflate model performance, leading to an overestimation of the true capabilities of current AI systems (Sakaguchi et al, 2020; Hendrycks et al, 2019)
  • Results:

    Leads to models with greater generalization, in comparison to training on a randomly sampled ImageNet of the same size, leading to up to 2% improvement in performance.
  • Conclusion:

    The authors presented a deep-dive into AFLITE – an iterative greedy algorithm that adversarially filters out spurious biases from data for accurate benchmark estimation.
  • The authors apply AFLITE to four datasets, including widely used benchmarks such as SNLI and ImageNet, and show that the strongest performance on the resulting filtered dataset drops by 30 points for SNLI and 20 points for ImageNet. The authors showed on out-of-distribution and adversarial test sets, models trained on the AFLITEfiltered subset generalize better.
  • All datasets and code for this work will be made public soon
Tables
  • Table1: Zero-shot SNLI accuracy on three out-of-distribution evaluation tasks, comparing RoBERTa-large models trained on the original SNLI data (D, size 550k), AFLITE-filtered data (D(φRoBERTa), size 182k), and on a random subset with the same size as the filtered data (D182k). We report average accuracy across 5 random seeds; the subscript denotes standard deviation. On HANS, all models are evaluated on the non-entailment cases of the three syntactic heuristics (Lexical overlap, Subsequence, and Constituent). The NLI-Diagnostics dataset is broken down into the instances requiring logical reasoning (Logic), world and commonsense knowledge (Knowledge), lexical semantics or predicate-argument structures. Stress tests for NLI are further categorized into Competence, Distraction and Noise tests
  • Table2: SNLI accuracy on Adversarial NLI using RoBERTa-large models pre-trained on the original SNLI data (D, size 550k) and on AFLITE-filtered data (D(φRoBERTa), size 182k). Both models were finetuned on the in-distribution training data for each round (Rd1, Rd2, and Rd3)
  • Table3: Dev accuracy (%) on the original SNLI dataset D and the datasets obtained through different AFLITE-filtering and other baselines. D92k indicates a randomly subsampled train dataset of the same size as D(φRoBERTa). ∆ indicates the difference in performance between the full model and the model trained on D(φRoBERTa)
  • Table4: Dev accuracy (%) on original and AFLITE-filtered MNLImatched and QNLI. The -PartialInput baselines show models trained on only Hypotheses for MNLI instances and only Answers for QNLI. ∆ indicates the difference in performance between the full model and the model trained on AFLITE-filtered data
  • Table5: Top-1 accuracy on ImageNet-A (<a class="ref-link" id="cHendrycks_et+al_2019_a" href="#rHendrycks_et+al_2019_a">Hendrycks et al, 2019</a>), an adversarial test set for image classification. The strongest model, EfficientNet-B7 improves by 2% on out-of-distribution ImageNetA images when trained on AFLITE-filtered data
  • Table6: Results on ImageNet, in Top-1 accuracy (%). We trained on AFLITE-filtered instances (D(ΦEN-B7)), and compare this to an equal-sized but random 40% subsample of ImageNet (D40%). We report results on the ImageNet validation set before and after filtering with AFLITE. ∆ indicates the difference in accuracy of the full model and the filtered model. Notably, evaluating on ImageNet-AFLITE is much harder—resulting in a drop of nearly 21 percentage points in accuracy for the strongest model
  • Table7: AFLITE hyperparameters used for running for indistribution benchmark estimation on different datasets. m denotes the size of the support of the expectation in Eq (4), t is the training set size for the linear classifiers, k is the size of each slice, and τ is an early-stopping filtering threshold. For ImageNet, we set n = 640K and hence do not need to control for τ . In every other setting, we set τ as above, and hence do not need to control for n
  • Table8: Mean Dev accuracy (%) on two models trained on four synthetic datasets before (D) and after (D(Φ)) AFLITE. Standard deviation across 10 runs with randomly chosen seeds is provided as a subscript. The datasets, also shown in Fig. 3 differ in the degree of separation between the two classes. Both models (SVM with an RBF kernel & linear classifier with logisitic regression) perform well on the original synthetic dataset, before filtering. The linear classifier performs well on the data, because it contains spurious artifacts, making the task artificially easier for it. However, after AFLITE, the linear model, relying mostly on the spurious features, clearly underperforms
  • Table9: Examples from SNLI, removed (top) and retained (bottom) by AFLITE. As is evident, the retained instances are slightly more challenging and capture more nuanced semantics in contrast to the removed instances. Removed instances also exhibit larger word overlap, and many other artifacts found in <a class="ref-link" id="cGururangan_et+al_2018_a" href="#rGururangan_et+al_2018_a">Gururangan et al (2018</a>)
Download tables as Excel
Related work
  • AFLITE is related to Zellers et al (2018)’s adversarial filtering (AF) algorithm, yet distinct in two key ways: it is (i) much more broadly applicable (by not requiring over generation of data instances), and (ii) considerably more lightweight (by not requiring re-training a model at each iteration of AF). Variants of this AF approach have recently been used to create other datasets such as HellaSwag (Zellers et al, 2019) and ANLI (Bhagavatula et al, 2019) by iteratively perturbing dataset instances until a target model cannot fit the resulting dataset. While effective, these approaches run into three main pitfalls. First, dataset curators need to explicitly devise a strategy of collecting or generating perturbations of a given instance. Second, the approach runs the risk of distributional bias where a discriminator can learn to distinguish between machine generated instances and human-generated ones. Finally it requires re-training a model at each iteration, which is computationally expensive especially when using a large model such as BERT as the adversary. In contrast, AFLITE focuses on addressing dataset biases from existing datasets instead of adversarially perturbing instances. AFLITE was earlier proposed by Sakaguchi et al (2020) to create the Winogrande dataset. This paper presents more thorough experiments, theoretical justification and results from generalizing the proposed approach to multiple popular NLP and Vision datasets.
Reference
  • Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.S., and Nguyen, A. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Arjovsky, M., Bottou, L., Gulrajani, I., and LopezPaz, D. Invariant risk minimization, 2019. URL https://arxiv.org/abs/1907.02893. ArXiv:1907.02893.
    Findings
  • Balog, M., Tripuraneni, N., Ghahramani, Z., and Weller, A. Lost relatives of the Gumbel trick. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Bhagavatula, C., Bras, R. L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., tau Yih, S. W., and Choi, Y. Abductive commonsense reasoning. In ICLR, 2019. URL https://arxiv.org/abs/1908.05739.
    Findings
  • Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference. In EMNLP, 201URL https://www.aclweb.org/anthology/D15-1075.
    Locate open access versionFindings
  • Chen, Q., Zhu, X.-D., Ling, Z.-H., Wei, S., Jiang, H., and Inkpen, D. Enhanced LSTM for natural language inference. In ACL, 201URL https://www.aclweb.org/anthology/P17-1152.
    Locate open access versionFindings
  • Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019.
    Findings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D. X. Robust physical-world attacks on deep learning models. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Fouhey, D. F., Kuo, W.-c., Efros, A. A., and Malik, J. From lifestyle vlogs to everyday interactions. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Gumbel, E. J. and Lieblein, J. Statistical theory of extreme values and some practical applications: A series of lectures. In Applied Mathematics Series, volume 33. National Bureau of Standards, USA, 1954.
    Google ScholarLocate open access versionFindings
  • Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. Annotation artifacts in natural language inference data. In NAACL, 2018. URL https://www.aclweb.org/anthology/N18-2017/.
    Locate open access versionFindings
  • He, H., Zha, S., and Wang, H. Unlearn dataset bias in natural language inference by fitting the residual. ArXiv, 2019. URL https://arxiv.org/abs/1908.10763.
    Findings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
    Google ScholarLocate open access versionFindings
  • Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and Song, D. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
    Findings
  • Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Jia, R. and Liang, P. Adversarial examples for evaluating reading comprehension systems. In EMNLP, 2017. URL https://www.aclweb.org/anthology/ D17-1215.
    Locate open access versionFindings
  • Kim, C., Sabharwal, A., and Ermon, S. Exact sampling with integer linear programs and random perturbations. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/abs/1412.6980.arXiv:1412.6980.
    Findings
  • Kool, W., van Hoof, H., and Welling, M. Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Li, Y., Li, Y., and Vasconcelos, N. RESOUND: Towards action recognition without representation bias. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Li, Y. C. and Vasconcelos, N. REPAIR: Removing representation bias by dataset resampling. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Liu, N. F., Schwartz, R., and Smith, N. A. Inoculation by fine-tuning: A method for analyzing challenge datasets. In NAACL, 2019a. doi: 10.18653/ v1/N19-12URL https://www.aclweb.org/anthology/N19-1225.
    Locate open access versionFindings
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M. S., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. S., and Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach, 2019b. URL https://arxiv.org/abs/1907.11692. ArXiv:1907.11692.
    Findings
  • Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • McCoy, R. T., Min, J., and Linzen, T. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance, 2019a. URL https://arxiv.org/abs/1911.02969.
    Findings
  • McCoy, T., Pavlick, E., and Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In ACL, 2019b. doi: 10.18653/ v1/P19-1334. URL https://www.aclweb.org/anthology/P19-1334.
    Locate open access versionFindings
  • Naik, A., Ravichander, A., Sadeh, N., Rose, C., and Neubig, G. Stress test evaluation for natural language inference. In ICCL, 2018. URL https://www.aclweb.org/anthology/C18-1198.
    Locate open access versionFindings
  • Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., and Kiela, D. Adversarial NLI: A new benchmark for natural language understanding, 2019. URL https://arxiv.org/abs/1910.14599.arXiv:1910.14599.
    Findings
  • Pennington, J., Socher, R., and Manning, C. D. GloVe: Global vectors for word representation. In EMNLP, 2014. URL https://www.aclweb.org/anthology/ D14-1162.
    Locate open access versionFindings
  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. S. Deep contextualized word representations. In NAACL, 2018. URL https://www.aclweb.org/anthology/N18-1202.
    Locate open access versionFindings
  • Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., and Van Durme, B. Hypothesis only baselines in natural language inference. In *SEM, 2018. URL https://www.aclweb.org/anthology/S18-2023.
    Locate open access versionFindings
  • Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016. URL https://www.aclweb.org/anthology/D16-1264.
    Locate open access versionFindings
  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. doi: 10.1007/ s11263-015-0816-y.
    Locate open access versionFindings
  • Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. WINOGRANDE: An adversarial winograd schema challenge at scale. In AAAI, 2020. URL https://arxiv.org/abs/1907.10641.
    Findings
  • Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Torralba, A. and Efros, A. A. Unbiased look at dataset bias. CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Tsuchiya, M. Performance impact caused by hidden bias of training data for recognizing textual entailment. In LREC, 2018.
    Google ScholarLocate open access versionFindings
  • Vieira, T. Gumbel-max trick and weighted reservoir sampling, 2014. URL https://bit.ly/310I39S.
    Findings
  • Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2018a. URL https://arxiv.org/abs/1804.07461.
    Findings
  • Wang, T., Zhu, J., Torralba, A., and Efros, A. A. Dataset distillation, 2018b. URL http://arxiv.org/abs/1811.10959.arXiv:1811.10959.
    Findings
  • Williams, A., Nangia, N., and Bowman, S. A broadcoverage challenge corpus for sentence understanding through inference. In NAACL, 2018. doi: 10.18653/ v1/N18-1101. URL https://www.aclweb.org/anthology/N18-1101.
    Findings
  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. Huggingface’s transformers: State-of-theart natural language processing, 2019. URL https://www.arxiv.org/abs/1910.03771.
    Findings
  • Zellers, R., Bisk, Y., Schwartz, R., and Choi, Y. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In EMNLP, 2018. URL https://www.aclweb.org/anthology/D18-1009.
    Locate open access versionFindings
  • Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. HellaSwag: Can a machine really finish your sentence? In ACL, 2019. URL https://www.aclweb.org/anthology/P19-1472.
    Locate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments