# Interpretation of NLP models through input marginalization

EMNLP 2020, 2020.

Keywords:

deep learningNational Research Foundation of Koreainput marginalizationblack boxmasked language modelingMore(15+)

Weibo:

Abstract:

To demystify the "black box" property of deep neural networks for natural language processing (NLP), several methods have been proposed to interpret their predictions by measuring the change in prediction probability after erasing each token of an input. Since existing methods replace each token with a predefined value (i.e., zero), the...More

Code:

Data:

Introduction

- The advent of deep learning has greatly improved the performances of natural language processing (NLP) models.
- Research in computer vision aims to interpret a target model by measuring attribution scores, i.e., how much each pixel in an input image contributes to the final prediction (Simonyan et al, 2013; Arras et al, 2017; Zeiler and Fergus, 2014; Lundberg and Lee, 2017).
- Since a pixel of an image corresponds to a token in a sentence, the attribution score of each token can provide an insight into the NLP model’s internal reasoning process.
- A straightforward approach is to ask, “How would the model reaction change if each token was not there?” and (a) Original sentence It’s clearly, great fun

Highlights

- The advent of deep learning has greatly improved the performances of natural language processing (NLP) models
- To show the model-agnostic and task-agnostic property of our method, we present interpretations of several types of deep neural networks (DNNs) trained for two tasks: sentiment analysis and natural language inference
- Note that the architectures of long short-term memory (LSTM) used for SST-2 and Stanford natural language inference (SNLI) are distinct
- We focused on the OOD problem arising from the widely used zero erasure scheme, which results in misleading interpretation
- To the best of our knowledge, neither the OOD problem has been raised in interpreting NLP models nor the attempt to resolve it has been undertaken
- The scope of this study was primarily focused on interpreting DNNs for sentiment analysis and natural language inference

Methods

- The authors measure the attribution score using the weight of evidence and marginalize over all possible candidate tokens using the MLM of BERT.
- The authors extend the method to multi-token cases and introduce adaptively truncated marginalization for an efficient computation.
- 3.1 Measurement of model output difference.
- To measure the changes in the model output, the authors adopt the widely used weight of evidence (WoE) (Robnik-Sikonja and Kononenko, 2008), which is a log odds difference of prediction probabilities.

Results

- SST-2 For sentiment analysis, the authors used the Stanford Sentiment Treebank binary classification corpus (SST-2) (Socher et al, 2013), which is a set of movie reviews labeled as positive or negative.
- The authors trained the bidirectional LSTM for SNLI.
- After training the target models, the authors interpreted their predictions through the proposed input marginalization.
- The authors used pre-trained BERT (Wolf et al, 2019) for likelihood modeling and σ was set to 10−5

Conclusion

- Interpretability is becoming more important owing to the increase in deep learning in NLP.
- The authors focused on the OOD problem arising from the widely used zero erasure scheme, which results in misleading interpretation.
- To the best of the knowledge, neither the OOD problem has been raised in interpreting NLP models nor the attempt to resolve it has been undertaken.
- As experimentally analyzed, the interpretation result of the method is affected by the likelihood modeling performance.
- The authors can expect even more faithful interpretation if the modeling performance improves

Summary

## Introduction:

The advent of deep learning has greatly improved the performances of natural language processing (NLP) models.- Research in computer vision aims to interpret a target model by measuring attribution scores, i.e., how much each pixel in an input image contributes to the final prediction (Simonyan et al, 2013; Arras et al, 2017; Zeiler and Fergus, 2014; Lundberg and Lee, 2017).
- Since a pixel of an image corresponds to a token in a sentence, the attribution score of each token can provide an insight into the NLP model’s internal reasoning process.
- A straightforward approach is to ask, “How would the model reaction change if each token was not there?” and (a) Original sentence It’s clearly, great fun
## Methods:

The authors measure the attribution score using the weight of evidence and marginalize over all possible candidate tokens using the MLM of BERT.- The authors extend the method to multi-token cases and introduce adaptively truncated marginalization for an efficient computation.
- 3.1 Measurement of model output difference.
- To measure the changes in the model output, the authors adopt the widely used weight of evidence (WoE) (Robnik-Sikonja and Kononenko, 2008), which is a log odds difference of prediction probabilities.
## Results:

SST-2 For sentiment analysis, the authors used the Stanford Sentiment Treebank binary classification corpus (SST-2) (Socher et al, 2013), which is a set of movie reviews labeled as positive or negative.- The authors trained the bidirectional LSTM for SNLI.
- After training the target models, the authors interpreted their predictions through the proposed input marginalization.
- The authors used pre-trained BERT (Wolf et al, 2019) for likelihood modeling and σ was set to 10−5
## Conclusion:

Interpretability is becoming more important owing to the increase in deep learning in NLP.- The authors focused on the OOD problem arising from the widely used zero erasure scheme, which results in misleading interpretation.
- To the best of the knowledge, neither the OOD problem has been raised in interpreting NLP models nor the attempt to resolve it has been undertaken.
- As experimentally analyzed, the interpretation result of the method is affected by the likelihood modeling performance.
- The authors can expect even more faithful interpretation if the modeling performance improves

- Table1: Test accuracy of the target models
- Table2: Comparison of AUCrep with the existing erasure scheme (the lower the better)
- Table3: The Pearson correlation with full marginalization and the average number of marginalization under various thresholds. σ: likelihood threshold, n: marginalization number threshold for fixed truncation

Related work

- 2.1 Interpretation of NLP models

Model-aware interpretation methods for DNNs use model information such as gradients. Saliency map (Simonyan et al, 2013) interprets an image classifier by computing the gradient of a target class logit score with respect to each input pixel. Since a token index is not ordinal as an image pixel, the gradient with respect to a token is meaningless. Hence, Li et al (2016) computed the gradient in an embedding space and Arras et al (2017) distributed the class score to input embedding dimensions through layer-wise relevance propagation. Both methods sum up the scores of each embedding dimension to provide the attribution score of a token. Because the score can have a negative or positive sign, the sum may offset each other, so the contribution of the token may become zero even if it does contribute to the prediction.

Funding

- This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2018R1A2B3001628], the Brain Korea 21 Plus Project in 2020, and Hyundai Motor Company

Study subjects and analysis

cases: 3

For simplicity, we merged very positive and positive, very negative and negative into positive (pos) and negative (neg), respectively, such that each token is given one tag among three. If a sentence is correctly classified to positive, then three cases exist: i) pos-tagged word: contributes positively and significantly to the prediction ii) neut-tagged word: does not contribute much to the prediction iii) neg-tagged word: contributes negatively to the prediction, where neut denotes neutral. To assess if our method can assign high score to case i), we measured the intersection of tokens (IoT) between pos-tagged tokens and highly attributed tokens in one sentence, motivated by intersection of union (IoU) which is a widely used interpretation evaluation metric in the vision field (Chang et al, 2018)

Reference

- Leila Arras, Gregoire Montavon, Klaus-Robert Muller, and Wojciech Samek. 2017. Explaining recurrent neural network predictions in sentiment analysis. EMNLP 2017, page 159.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. 2018. Explaining image classifiers by counterfactual generation.
- Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728.
- Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pages 80–89. IEEE.
- Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136.
- Sarthak Jain and Byron C Wallace. 201Attention is not explanation. arXiv preprint arXiv:1902.10186.
- Xisen Jin, Junyi Du, Zhongyu Wei, Xiangyang Xue, and Xiang Ren. 2019. Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. arXiv preprint arXiv:1911.06194.
- Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2015. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066.
- Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774.
- Vitali Petsiuk, Abir Das, and Kate Saenko. 2018. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421.
- Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. 2019. Perturbation sensitivity analysis to detect unintended model biases. arXiv preprint arXiv:1910.04210.
- Marko Robnik-Sikonja and Igor Kononenko. 2008. Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5):589–600.
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. arXiv preprint arXiv:1908.04626.
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
- Jihun Yi, Eunji Kim, Siwon Kim, and Sungroh Yoon. 2020. Information-theoretic visual explanation for black-box classifiers.
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer.
- Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253.
- Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. 2017. Visualizing deep neural network decisions: Prediction difference analysis. arXiv preprint arXiv:1702.04595.

Full Text

Tags

Comments