Towards Interpreting BERT for Reading Comprehension Based QA

Sahana Ramnath
Sahana Ramnath
Deep Sahni
Deep Sahni

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views27
Other Links: arxiv.org
Keywords:
Jensen-Shannon Divergencepre bert modelhuman level performancenlp taskintegrated gradientsMore(3+)
Weibo:
We segregate the passage words into three categories: answer words, supporting words, and query words, where supporting words are the words surrounding the answer within a window size of 5

Abstract:

BERT and its variants have achieved state-of-the-art performance in various NLP tasks. Since then, various works have been proposed to analyze the linguistic information being captured in BERT. However, the current works do not provide an insight into how BERT is able to achieve near human-level performance on the task of Reading Compre...More

Code:

Data:

0
Introduction
  • The past decade has witnessed a surge in the development of deep neural network models to solve NLP tasks.
  • The authors observe that the initial layers primarily focus on question words that are present in the passage.
  • Through a focused analysis of quantifier questions, the authors observe that BERT pays importance to many words similar to the answer in later layers.
Highlights
  • The past decade has witnessed a surge in the development of deep neural network models to solve NLP tasks
  • We observe that the initial layers primarily focus on question words that are present in the passage
  • The initial layers should focus on question words, and the latter should zoom in on contextual words that point to the answer
  • We segregate the passage words into three categories: answer words, supporting words, and query words, where supporting words are the words surrounding the answer within a window size of 5
  • Query words are the question words which appear in the passage
  • We take the top-5 words marked as important in Il for any layer l and compute how many words from each of the above-defined categories appear in the top-5 words
Results
  • The authors focus on analyzing BERT’s layers for RCQA to understand their QA-specific roles and their behavior on potentially confusing quantifier questions.
  • The authors aim to understand each BERT layer’s functionality for the RCQA task; the authors want to identify the passage words that are of primary importance at each layer for the answer.
  • The authors calculate the integrated gradients at each layer IGl(xi) for all passage words wi using Algorithm 1.
  • The authors compute importance scores for each wi by taking the euclidean norm of IG(wi) and normalizing it to get a probability distribution Il over the passage words.
  • The authors quantify and visualize a layer’s function as its distribution of importance over the passage words Il. To compute the similarity between any
  • Based on the defined layers’ functionality Il, the authors try to identify which layers focus more on the question, the context around the answer, etc.
  • We see that the initial layers 0,1 and 2 are trying to connect the passage and the query (‘relegated’, ‘because’, ‘Polonia’ get high importance scores).
  • In later layers, the question words separate from the answer and the supporting words, (ii) Across all 12 layers, embeddings for four, eight remain very close together, which could have led to the model making a wrong prediction.
  • For the example in Table 4, the authors observed that in its later layers, BERT gives high importance to the words ‘eight’, ‘four’, and
  • This shows that BERT, in its later layers, distributes its focus over confusing words.
  • This behavior is very different from the assumed roles a layer might take to answer the question, as it is expected that such words were considered in the initial rather than final layers.
Conclusion
  • The authors present results and analysis to show that BERT is learning some form of passagequery interaction in its initial layers before arriving at the answer.
  • The authors found the following observations interesting and with a potential to be probed further: (i) why do the question word representations move away from contextual and answer representation in later layers?
  • (ii) If the focus on confusing words increases from the initial to later layers, how does BERT still have a high accuracy?
Summary
  • The past decade has witnessed a surge in the development of deep neural network models to solve NLP tasks.
  • The authors observe that the initial layers primarily focus on question words that are present in the passage.
  • Through a focused analysis of quantifier questions, the authors observe that BERT pays importance to many words similar to the answer in later layers.
  • The authors focus on analyzing BERT’s layers for RCQA to understand their QA-specific roles and their behavior on potentially confusing quantifier questions.
  • The authors aim to understand each BERT layer’s functionality for the RCQA task; the authors want to identify the passage words that are of primary importance at each layer for the answer.
  • The authors calculate the integrated gradients at each layer IGl(xi) for all passage words wi using Algorithm 1.
  • The authors compute importance scores for each wi by taking the euclidean norm of IG(wi) and normalizing it to get a probability distribution Il over the passage words.
  • The authors quantify and visualize a layer’s function as its distribution of importance over the passage words Il. To compute the similarity between any
  • Based on the defined layers’ functionality Il, the authors try to identify which layers focus more on the question, the context around the answer, etc.
  • We see that the initial layers 0,1 and 2 are trying to connect the passage and the query (‘relegated’, ‘because’, ‘Polonia’ get high importance scores).
  • In later layers, the question words separate from the answer and the supporting words, (ii) Across all 12 layers, embeddings for four, eight remain very close together, which could have led to the model making a wrong prediction.
  • For the example in Table 4, the authors observed that in its later layers, BERT gives high importance to the words ‘eight’, ‘four’, and
  • This shows that BERT, in its later layers, distributes its focus over confusing words.
  • This behavior is very different from the assumed roles a layer might take to answer the question, as it is expected that such words were considered in the initial rather than final layers.
  • The authors present results and analysis to show that BERT is learning some form of passagequery interaction in its initial layers before arriving at the answer.
  • The authors found the following observations interesting and with a potential to be probed further: (i) why do the question word representations move away from contextual and answer representation in later layers?
  • (ii) If the focus on confusing words increases from the initial to later layers, how does BERT still have a high accuracy?
Tables
  • Table1: Semantic statistics of top-5 words - SQuAD
  • Table2: Semantic statistics of top-5 words - DuoRC
  • Table3: Heatmap visualisation of the Il distribution over BERT’s first and last 3 layers, for a sample from SQuAD. The initial layers focus on question specific words and latter focus on supporting words that lead to answer
  • Table4: Sample from the dev-split of SQuAD. Blue shows the answer, purple shows the contextual passage words and green shows the query
  • Table5: Part-of-Speech statistics of top-5 words - SQuAD
  • Table6: Part-of-Speech statistics of top-5 words - DuoRC
Download tables as Excel
Related work
Funding
  • We thank Google for supporting Preksha Nema contribution through the Google Ph.D
Study subjects and analysis
uniform samples: 50
We calculate the integrated gradients at each layer IGl(xi) for all passage words wi using Algorithm 1. We approximate the above integral across 50 uniform samples of α ∈ [0, 1]. We then compute importance scores for each wi by taking the euclidean norm of IG(wi) and normalizing it to get a probability distribution Il over the passage words

samples: 1000
This further means the two layers consider different words as salient. We visualize heatmaps for the dev-splits of SQuAD (Figures 1a, 1b) and DuoRC (Figures 1c, 1d), averaging over 1000 samples in each case. We analyze the distribution in two parts: (i) we retain only top-k scores in each layer and zero out the rest, which denotes the distribution’s head. (ii) we zero the top-k scores in each layer and retain the rest, which denotes the distribution’s tail

Reference
  • Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of bert’s attention. CoRR, abs/1906.04341.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
    Findings
  • Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.
    Findings
  • Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. CoRR, abs/1902.10186.
    Findings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. CoRR, abs/1707.07328.
    Findings
  • Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 201Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. CoRR, abs/1705.03551.
    Findings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv eprints.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 201Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
    Findings
  • Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the model understand the question? CoRR, abs/1805.05492.
    Findings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
    Findings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
    Findings
  • Matthew E Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. CoRR, abs/1606.05250.
    Findings
  • Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. Duorc: Towards complex language understanding with paraphrased reading comprehension. CoRR, abs/1804.07927.
    Findings
  • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 20Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
    Findings
  • Sofia Serrano and Noah A Smith. 2019. Is attention interpretable? arXiv preprint arXiv:1906.03731.
    Findings
  • Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing Jiang. 2019. What does bert learn from multiplechoice reading comprehension datasets? arXiv preprint arXiv:1910.12391.
    Findings
  • Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. CoRR, abs/1703.01365.
    Findings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
    Findings
  • Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. arXiv preprint arXiv:1908.04626.
    Findings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. CoRR, abs/1611.01604.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments