Beyond Accuracy: Behavioral Testing of NLP models with CheckList

ACL, pp. 4902-4912, 2020.

Cited by: 0|Bibtex|Views1412
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Adopting principles from behavioral testing in software engineering, we propose CheckList, a model-agnostic and task-agnostic testing methodology that tests individual capabilities of the model using three different test types

Abstract:

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introd...More
0
Introduction
  • One of the primary goals of training NLP models is generalization. Since testing “in the wild” is expensive and does not allow for fast iterations, the standard paradigm for evaluation is using trainvalidation-test splits to estimate the accuracy of the model, including the use of leader boards to track progress on a task (Rajpurkar et al, 2016).
  • A number of additional evaluation approaches have been proposed, such as evaluating robustness to noise (Belinkov and Bisk, 2018; Rychalska et al, 2019) or adversarial changes (Ribeiro et al, 2018; Iyyer et al, 2018), fairness (Prabhakaran et al, 2019), logical consistency (Ribeiro et al, 2019), explanations (Ribeiro et al, 2016), diagnostic datasets (Wang et al, 2019b), and interactive error analysis (Wu et al, 2019)
  • These approaches focus either on individual tasks such as Question Answering or Natural Language Inference, or on a few capabilities, and do not provide comprehensive guidance on how to evaluate models.
  • While there are clear similarities, many insights from software engineering are yet to be applied to NLP models
Highlights
  • One of the primary goals of training NLP models is generalization
  • While there are clear similarities, many insights from software engineering are yet to be applied to NLP models
  • Accuracy on benchmarks is not sufficient for evaluating NLP models
  • Adopting principles from behavioral testing in software engineering, we propose CheckList, a model-agnostic and task-agnostic testing methodology that tests individual capabilities of the model using three different test types
  • Our user studies indicate that CheckList is easy to learn and use, and helpful both for expert users who have tested their models at length as well as for practitioners with little experience in a task
  • Since many tests can be applied across tasks as is or with minor variations, we expect that collaborative test creation will result in evaluation of NLP models that is much more robust and detailed, beyond just accuracy on held-out data
Conclusion
  • The authors applied the same process to very different tasks, and found that tests reveal interesting failures on a variety of task-relevant linguistic capabilities.
  • While some tests are task specific, the capabilities and test types are general; many can be applied across tasks, as is or with minor variation
  • This small selection of tests illustrates the benefits of systematic testing in addition to standard evaluation.
  • These tasks may be considered “solved” based on benchmark accuracy results, but the tests highlight various areas of improvement – in particular, failure to demonstrate basic skills that are de facto needs for the task at hand.
  • CheckList is open source, and available at https://github.com/marcotcr/checklist
Summary
  • Introduction:

    One of the primary goals of training NLP models is generalization. Since testing “in the wild” is expensive and does not allow for fast iterations, the standard paradigm for evaluation is using trainvalidation-test splits to estimate the accuracy of the model, including the use of leader boards to track progress on a task (Rajpurkar et al, 2016).
  • A number of additional evaluation approaches have been proposed, such as evaluating robustness to noise (Belinkov and Bisk, 2018; Rychalska et al, 2019) or adversarial changes (Ribeiro et al, 2018; Iyyer et al, 2018), fairness (Prabhakaran et al, 2019), logical consistency (Ribeiro et al, 2019), explanations (Ribeiro et al, 2016), diagnostic datasets (Wang et al, 2019b), and interactive error analysis (Wu et al, 2019)
  • These approaches focus either on individual tasks such as Question Answering or Natural Language Inference, or on a few capabilities, and do not provide comprehensive guidance on how to evaluate models.
  • While there are clear similarities, many insights from software engineering are yet to be applied to NLP models
  • Conclusion:

    The authors applied the same process to very different tasks, and found that tests reveal interesting failures on a variety of task-relevant linguistic capabilities.
  • While some tests are task specific, the capabilities and test types are general; many can be applied across tasks, as is or with minor variation
  • This small selection of tests illustrates the benefits of systematic testing in addition to standard evaluation.
  • These tasks may be considered “solved” based on benchmark accuracy results, but the tests highlight various areas of improvement – in particular, failure to demonstrate basic skills that are de facto needs for the task at hand.
  • CheckList is open source, and available at https://github.com/marcotcr/checklist
Tables
  • Table1: A selection of tests for sentiment analysis. All examples (right) are failures of at least one model
  • Table2: A selection of tests for Quora Question Pair. All examples (right) are failures of at least one model
  • Table3: A selection of tests for Machine Comprehension
  • Table4: User Study Results: first three rows indicate number of tests created, number of test cases per test and number of capabilities tested. Users report the severity of their findings (last two rows)
Download tables as Excel
Related work
  • One approach to evaluate specific linguistic capabilities is to create challenge datasets. Belinkov and Glass (2019) note benefits of this approach, such as systematic control over data, as well as drawbacks, such as small scale and lack of resemblance to “real” data. Further, they note that the majority of challenge sets are for Natural Language Inference. We do not aim for CheckList to replace challenge or benchmark datasets, but to complement them. We believe CheckList maintains many of the benefits of challenge sets while mitigating their drawbacks: authoring examples from scratch with templates provides systematic control, while perturbation-based INV and DIR tests allow for testing behavior in unlabeled, naturally-occurring data. While many challenge sets focus on extreme or difficult cases (Naik et al, 2018), MFTs also focus on what should be easy cases given a capability, uncovering severe bugs. Finally, the user study demonstrates that CheckList can be used effectively for a variety of tasks with low effort: users created a complete test suite for sentiment analysis in a day, and MFTs for QQP in two hours, both revealing previously unknown, severe bugs.
Funding
  • Sameer was funded in part by the NSF award #IIS-1756023, and in part by the DARPA MCS program under Contract No N660011924033 with the United States Office of Naval Research
Reference
  • Saleema Amershi, Andrew Begel, Christian Bird, Rob DeLine, Harald Gall, Ece Kamar, Nachi Nagappan, Besmira Nushi, and Tom Zimmermann. 2019. Software engineering for machine learning: A case study. In International Conference on Software Engineering (ICSE 2019) - Software Engineering in Practice track. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Boris Beizer. 1995. Black-box Testing: Techniques for Functional Testing of Software and Systems. John Wiley & Sons, Inc., New York, NY, USA.
    Google ScholarFindings
  • Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
    Google ScholarLocate open access versionFindings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Empirical Methods in Natural Language Processing (EMNLP), pages 1161–1166.
    Google ScholarLocate open access versionFindings
  • Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of NAACL-HLT, pages 1875–1885.
    Google ScholarLocate open access versionFindings
  • Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia, Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross, Tal Linzen, Benjamin Van Durme, et al. 2019. Probing what different nlp tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (* SEM 2019), pages 235–249.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress Test Evaluation for Natural Language Inference. In International Conference on Computational Linguistics (COLING).
    Google ScholarLocate open access versionFindings
  • Kayur Patel, James Fogarty, James A Landay, and Beverly Harrison. 2008. Investigating statistical machine learning as a tool for software development. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 667–676. ACM.
    Google ScholarLocate open access versionFindings
  • Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. 2019. Perturbation sensitivity analysis to detect unintended model biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5740–5745, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784– 789, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? evaluating consistency of question-answering models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6174–6184.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 20Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Anna Rogers, Shashwath Hosur Ananthakrishna, and Anna Rumshisky. 20What’s in your embedding, and how it predicts task performance. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2690–2703, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. 20Models in the wild: On corruption robustness of neural nlp systems. In International Conference on Neural Information Processing, pages 235–247. Springer.
    Google ScholarLocate open access versionFindings
  • Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Transactions on software engineering, 42(9):805–824.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yulia Tsvetkov, Manaal Faruqui, and Chris Dyer. 2016. Correlation-based intrinsic evaluation of word vector representations. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 111–115, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019a. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pages 3261–3275.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel S Weld. 2019. Errudite: Scalable, reproducible, and testable error analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 747–763.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Best Paper
Best Paper of ACL, 2020
Tags
Comments