VisBERT: Hidden-State Visualizations for Transformers

Betty van Aken
Betty van Aken
Benjamin Winter
Benjamin Winter
Felix A. Gers
Felix A. Gers

WWW '20: The Web Conference 2020 Taipei Taiwan April, 2020, pp. 207-211, 2020.

Cited by: 1|Bibtex|Views58|DOI:https://doi.org/10.1145/3366424.3383542
EI
Other Links: dl.acm.org|dblp.uni-trier.de|arxiv.org|academic.microsoft.com
Weibo:
VisBERT establishes a novel method to analyze the behavior of Bidirectional Encoder Representations from Transformers models, in particular regarding the Question Answering task

Abstract:

Explainability and interpretability are two important concepts, the absence of which can and should impede the application of well-performing neural networks to real-world problems. At the same time, they are difficult to incorporate into the large, black-box models that achieve state-of-the-art results in a multitude of NLP tasks. Bidire...More

Code:

Data:

0
Introduction
  • Understanding black-box models is an increasingly prominent area of research. While the performance of neural networks has been steadily improving in nearly every domain, the ability to understand how they work, and how they come to the conclusions they draw is only improving slowly.
  • Instead of the attention values, the authors follow the work in [11] and visualize the hidden states between each BERT layer, and with that the token representations, as they are transformed through the network.
  • VisBERT2, an interactive web tool for interpretable visualization of hidden-states within BERT models fine-tuned on Question Answering.
Highlights
  • Understanding black-box models is an increasingly prominent area of research
  • In order for large neural networks to be confidently deployed in safety-critical applications, features like transparency, interpretability and explainability are paramount
  • One such class of black-box models are Transformer models, Bidirectional Encoder Representations from Transformers (BERT) in particular. These models have become the state-of-the-art for many different NLP tasks in recent months
  • VisBERT2, an interactive web tool for interpretable visualization of hidden-states within BERT models fine-tuned on Question Answering
  • For each task we provide a separate fine-tuned BERT model
  • VisBERT establishes a novel method to analyze the behavior of BERT models, in particular regarding the Question Answering task
Results
  • Visualizations of the inference process of unseen examples from three diverse Question Answering datasets, including three BERT models fine-tuned on these sets.
  • The presented tool allows users to test the abilities and shortcomings of own Question Answering models on arbitrary samples.
  • Each encoder block includes a multi-headed self-attention module, which transforms each token using the entire input context, normalization, and a Feed-Forward network at the end, which outputs the token representations used by the subsequent layer.
  • The authors can observe the changing token relations that the model forms throughout the inference process.
  • To that end the authors use the hidden states after each Transformer encoder block, which contains a vector for each token with a dimensionality of 768 (BERT-base) or 1024 (BERT-large).
  • The authors further categorize the tokens based on affiliation to question, supporting facts or predicted answer in order to facilitate interpretability.
  • In addition to the included datasets, the tool can be extended to other Question Answering tasks.
  • By using the layer-slider on top of the graph, the user is able to go through all layers of the model and observe the changes within the token representations.
  • This allows users to find out which QA model (SQuAD, HotpotQA or bAbI) fits a specific question type best and produces the right result.
  • A user can add distracting facts to the context and check whether the model is still able to follow the same inference path.
  • The authors' tool allows to observe resulting changes in the prediction, and within the hidden states of a model.
Conclusion
  • VisBERT establishes a novel method to analyze the behavior of BERT models, in particular regarding the Question Answering task.
  • The authors establish this behaviour on three diverse Question Answering datasets and make all three models available for users to make their own analyses on their own data, as well as the code to reproduce this visualization.
  • The authors' tool can be extended to other BERT models, fine-tuned on different QA datasets or even other NLP tasks entirely, and to other Transformer based models like GPT-2 [8].
Summary
  • Understanding black-box models is an increasingly prominent area of research. While the performance of neural networks has been steadily improving in nearly every domain, the ability to understand how they work, and how they come to the conclusions they draw is only improving slowly.
  • Instead of the attention values, the authors follow the work in [11] and visualize the hidden states between each BERT layer, and with that the token representations, as they are transformed through the network.
  • VisBERT2, an interactive web tool for interpretable visualization of hidden-states within BERT models fine-tuned on Question Answering.
  • Visualizations of the inference process of unseen examples from three diverse Question Answering datasets, including three BERT models fine-tuned on these sets.
  • The presented tool allows users to test the abilities and shortcomings of own Question Answering models on arbitrary samples.
  • Each encoder block includes a multi-headed self-attention module, which transforms each token using the entire input context, normalization, and a Feed-Forward network at the end, which outputs the token representations used by the subsequent layer.
  • The authors can observe the changing token relations that the model forms throughout the inference process.
  • To that end the authors use the hidden states after each Transformer encoder block, which contains a vector for each token with a dimensionality of 768 (BERT-base) or 1024 (BERT-large).
  • The authors further categorize the tokens based on affiliation to question, supporting facts or predicted answer in order to facilitate interpretability.
  • In addition to the included datasets, the tool can be extended to other Question Answering tasks.
  • By using the layer-slider on top of the graph, the user is able to go through all layers of the model and observe the changes within the token representations.
  • This allows users to find out which QA model (SQuAD, HotpotQA or bAbI) fits a specific question type best and produces the right result.
  • A user can add distracting facts to the context and check whether the model is still able to follow the same inference path.
  • The authors' tool allows to observe resulting changes in the prediction, and within the hidden states of a model.
  • VisBERT establishes a novel method to analyze the behavior of BERT models, in particular regarding the Question Answering task.
  • The authors establish this behaviour on three diverse Question Answering datasets and make all three models available for users to make their own analyses on their own data, as well as the code to reproduce this visualization.
  • The authors' tool can be extended to other BERT models, fine-tuned on different QA datasets or even other NLP tasks entirely, and to other Transformer based models like GPT-2 [8].
Funding
  • Our work is funded by the European Unions Horizon 2020 research and innovation programme under grant agreement 732328 (FashionBrain), by the German Federal Ministry of Education and Research (BMBF) under grant agreement 01UG1735BX (NOHATE) and by the German Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreements 01MD19013D (Smart-MD), 01MD19003E (PLASS) and 01MK2008D (Servicemeister)
Study subjects and analysis
public QA datasets: 3
Besides that, the task often requires multiple inference steps, especially in multi-hop scenarios, which allows us to gain further insights about BERT’s reasoning process. We use the three public QA datasets SQuAD [9], HotpotQA [16] and bAbI QA [15] to show the tool’s applicability on three diverse QA tasks including multi-hop reasoning cases. Apart from that, the principle of VisBERT can be easily extended to other up- or downstream NLP tasks

diverse Question Answering datasets: 3
• VisBERT2, an interactive web tool for interpretable visualization of hidden-states within BERT models fine-tuned on Question Answering. • Visualizations of the inference process of unseen examples from three diverse Question Answering datasets, including three BERT (base and large) models fine-tuned on these sets. • Identification of four stages of inference that can be observed in all analysed Question Answering tasks

diverse Question Answering datasets: 3
Additionally, VisBERT reveals four phases in BERT’s transformations that are common to all of the datasets we examined and that mirror the traditional NLP pipeline, cf. [10]. We establish this behaviour on three diverse Question Answering datasets and make all three models available for users to make their own analyses on their own data, as well as the code to reproduce this visualization. Future Work

Reference
  • Pierre Comon. 1994. Independent component analysis, A new concept? Signal Processing 36 (1994).
    Google ScholarFindings
  • Karl Pearson F.R.S. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (1901).
    Google ScholarLocate open access versionFindings
  • Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In NAACL ’19.
    Google ScholarFindings
  • Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP ’17 (2017).
    Google ScholarLocate open access versionFindings
  • Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Information Theory (1982).
    Google ScholarLocate open access versionFindings
  • L. McInnes, J. Healy, and J. Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints (2018). arXiv:1802.03426
    Findings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR ’13 Workshop Track.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In EMNLP ’16.
    Google ScholarFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classical NLP Pipeline. In ACL ’19.
    Google ScholarLocate open access versionFindings
  • Betty van Aken, Benjamin Winter, Alexander Löser, and Felix A Gers. 2019. How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer
    Google ScholarFindings
  • Laurens van der Maaten. 2009. Learning a Parametric Embedding by Preserving
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In NIPS ’17.
    Google ScholarLocate open access versionFindings
  • Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model. ACL ’19 System Demonstrations (2019).
    Google ScholarFindings
  • Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2016. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In ICLR ’16.
    Google ScholarFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP ’18.
    Google ScholarFindings
  • Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 20Adversarial Examples: Attacks and Defenses for Deep Learning. arXiv preprint arXiv:1712.07107 (2017).
    Findings
Full Text
Your rating :
0

 

Tags
Comments