Semantically Equivalent Adversarial Rules for Debugging NLP models

ACL, pp. 856-865, 2018.

Cited by: 116|Bibtex|Views178
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We presented examples of such bugs discovered in state-of-theart models for various tasks, and demonstrated via user studies that non-experts and experts alike are much better at detecting local and global bugs in NLP models by using our methods

Abstract:

Complex machine learning models for NLP are often brittle, making different predictions for input instances that are extremely similar semantically. To automatically detect this behavior for individual instances, we present semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s pr...More

Code:

Data:

0
Introduction
  • With increasing complexity of models for tasks like classification (Joulin et al, 2016), machine comprehension (Rajpurkar et al, 2016; Seo et al, 2017), and visual question answering (Zhu et al, 2016), models are becoming increasingly challenging to debug, and to determine whether they are ready for deployment.
  • While held-out accuracy is often useful, it is not sufficient: practitioners consistently overestimate their model’s generalization (Patel et al, 2008) since test data is usually gathered in the same manner as training and validation
  • When deployed, these seemingly accurate models encounter sentences that are written very differently than the ones in the training data, making them prone to mistakes, and fragile with respect to distracting additions (Jia and Liang, 2017).
Highlights
  • With increasing complexity of models for tasks like classification (Joulin et al, 2016), machine comprehension (Rajpurkar et al, 2016; Seo et al, 2017), and visual question answering (Zhu et al, 2016), models are becoming increasingly challenging to debug, and to determine whether they are ready for deployment
  • Inspired by adversarial examples for images, we introduce semantically equivalent adversaries (SEAs) – text inputs that are perturbed in semantics-preserving ways, but induce changes in a black box model’s predictions
  • We introduced semantically equivalent adversaries and semantically equivalent adversarial rules – adversarial examples and rules that preserve semantics, while causing models to make mistakes
  • We presented examples of such bugs discovered in state-of-theart models for various tasks, and demonstrated via user studies that non-experts and experts alike are much better at detecting local and global bugs in NLP models by using our methods
  • We demonstrated that semantically equivalent adversaries and semantically equivalent adversarial rules can be an invaluable tool for debugging NLP models, while indicating their current limitations and avenues for future work
  • We show examples of semantically equivalent adversarial rules that are rejected by users in Table 7 – the semantic scorer does not sufficiently penalize preposition changes, and is biased towards common terms
Results
  • The authors' experiments indicate that SEAs and SEARs make humans significantly better at detecting impactful bugs – SEARs uncover bugs that cause 3 to 4 times more mistakes than human-generated rules, in much less time.
  • The fact that VQA is fragile to “Which” questions is because questions of this form are not in the training set, while probably stems from an American bias in data collection.
  • Changes induced by these four rules flip more than 10% of the predictions in the validation data, which is of critical concern if the model is being evaluated for production
Conclusion
  • The authors introduced SEAs and SEARs – adversarial examples and rules that preserve semantics, while causing models to make mistakes.
  • The authors show examples of SEARs that are rejected by users in Table 7 – the semantic scorer does not sufficiently penalize preposition changes, and is biased towards common terms.
Summary
  • Introduction:

    With increasing complexity of models for tasks like classification (Joulin et al, 2016), machine comprehension (Rajpurkar et al, 2016; Seo et al, 2017), and visual question answering (Zhu et al, 2016), models are becoming increasingly challenging to debug, and to determine whether they are ready for deployment.
  • While held-out accuracy is often useful, it is not sufficient: practitioners consistently overestimate their model’s generalization (Patel et al, 2008) since test data is usually gathered in the same manner as training and validation
  • When deployed, these seemingly accurate models encounter sentences that are written very differently than the ones in the training data, making them prone to mistakes, and fragile with respect to distracting additions (Jia and Liang, 2017).
  • Results:

    The authors' experiments indicate that SEAs and SEARs make humans significantly better at detecting impactful bugs – SEARs uncover bugs that cause 3 to 4 times more mistakes than human-generated rules, in much less time.
  • The fact that VQA is fragile to “Which” questions is because questions of this form are not in the training set, while probably stems from an American bias in data collection.
  • Changes induced by these four rules flip more than 10% of the predictions in the validation data, which is of critical concern if the model is being evaluated for production
  • Conclusion:

    The authors introduced SEAs and SEARs – adversarial examples and rules that preserve semantics, while causing models to make mistakes.
  • The authors show examples of SEARs that are rejected by users in Table 7 – the semantic scorer does not sufficiently penalize preposition changes, and is biased towards common terms.
Tables
  • Table1: SEARs for Machine Comprehension
  • Table2: SEARs for Visual QA
  • Table3: SEARs for Sentiment Analysis
  • Table4: Finding Semantically Equivalent Adversaries: we compare how often humans produce semantics-preserving adversaries, when compared to our automatically generated adversaries (SEA, left) and our adversaries filtered by humans (HSEA, right). There are four possible outcomes: neither produces a semantic equivalent adversary (i.e. they either do not produce an adversary or the adversary produced is not semantically equivalent), both do, or only one is able to do so
  • Table5: Examples of generated adversaries nificantly change the sentence structure, which the translation-based semantic scorer does not
  • Table6: Fixing bugs using SEARs: Effect of retraining models using SEARs, both on original validation and on sensitivity dataset. Retraining significantly reduces the number of bugs, with statistically insignificant changes to accuracy
  • Table7: SEARs for VQA that are rejected by users
Download tables as Excel
Related work
  • Previous work on debugging primarily focuses on explaining predictions in validation data in order to uncover bugs (Ribeiro et al, 2016, 2018; Kulesza et al, 2011), or find labeling errors (Zhang et al, 2018; Koh and Liang, 2017). Our work is complementary to these techniques, as they provide no mechanism to detect oversensitivity bugs. We are able to uncover these bugs even when they are not present in the data, since we generate sentences.

    Adversarial examples for image recognition are typically indistinguishable to the human eye (Szegedy et al, 2014). These are more of a security concern than bugs per se, as images with adversarial noise are not “natural”, and not expected to occur in the real world outside of targeted attacks. Adversaries are usually specific to predictions, and even universal adversarial perturbations (Moosavi-Dezfooli et al, 2017) are not natural, semantically meaningful to humans, or actionable. “Imperceptible” adversarial noise does not carry over from images to text, as adding or changing a single word in a sentence can drastically alter its meaning. Jia and Liang (2017) recognize that a true analog to detect oversensitivity would need semantic-preserving perturbations, but do not pursue an automated solution due to the difficulty of paraphrase generation. Their adversaries are whole sentence concatenations, generated by manually defined rules tailored to reading comprehension, and each adversary is specific to an individual instance. Zhao et al (2018) generate natural text adversaries by projecting the input data to a latent space using a generative adversarial networks (GANs), and searching for adversaries close to the original instance in this latent space. Apart from the challenge of training GANs to generate high quality text, there is no guarantee that an example close in the latent space is semantically equivalent. Ebrahimi et al (2018), along with proposing character-level changes that are not semanticpreserving, also propose a heuristic that replaces single words adversarially to preserve semantics. This approach not only depends on having whitebox access to the model, but is also not able to generate many adversaries (only ∼ 1.6% for sentiment analysis, compare to ∼ 33% for SEAs in Table 4b). Developed concurrently with our work, Iyyer et al (2018) proposes a neural paraphrase model based on back-translated data, which is able to produce paraphrases that have different sentence structures from the original. They use paraphrases to generate adversaries and try to post-process nonsensical outputs, but they do not explicitly reject non-semantics preserving ones, nor do they try to induce rules from individual adversaries. In any case, their adversaries are also useful for data augmentation, in experiments similar to ours.
Funding
  • This work was supported in part by ONR award #N00014-13-1-0023, in part by NSF award #IIS1756023, and in part by funding from FICO
  • The views expressed are of the authors and do not reflect the policy or opinion of the funding agencies
Reference
  • Aayush Bansal, Ali Farhadi, and Devi Parikh. 2014. Towards transparent systems: Semantic characterization of failure modes. In European Conference on Computer Vision (ECCV).
    Google ScholarLocate open access versionFindings
  • Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Deijing Dou. 2018. HotFlip: White-Box Adversarial Examples for NLP. In Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Matt A Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. Allennlp: A deep semantic natural language processing platform.
    Google ScholarFindings
  • Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In North American Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 201Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
    Findings
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 201Opennmt: Open-source toolkit for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. 2015. From group to individual labels using deep features. In Knowledge Discovery and Data Mining (KDD).
    Google ScholarFindings
  • Andreas Krause and Daniel Golovin. 2014. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems.
    Google ScholarFindings
  • Todd Kulesza, Simone Stumpf, Weng-Keen Wong, Margaret M. Burnett, Stephen Perona, Andrew Jensen Ko, and Ian Oberst. 20Whyoriented end-user debugging of naive bayes text classification. TiiS 1:2:1–2:31.
    Google ScholarLocate open access versionFindings
  • Mirella Lapata, Rico Sennrich, and Jonathan Mallinson. 2017. Paraphrasing revisited with neural machine translation. In European Chapter of the ACL (EACL).
    Google ScholarLocate open access versionFindings
  • Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, and Daniel Jurafsky. 2016. Understanding neural networks through representation erasure. CoRR abs/1612.08220.
    Findings
  • Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Kayur Patel, James Fogarty, James A. Landay, and Beverly Harrison. 2008. Investigating statistical machine learning as a tool for software development. In Human Factors in Computing Systems (CHI).
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD).
    Google ScholarFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision modelagnostic explanations. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In International Conference on Language Resources and Evaluation (LREC).
    Google ScholarLocate open access versionFindings
  • John Wieting and Kevin Gimpel. 2017. Revisiting recurrent networks for paraphrastic sentence embeddings. In Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. 2018. Training set debugging using trusted items. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adversarial examples. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Oliver Groth, Michael Bernstein, and Li FeiFei. 2016. Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments