AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We segregate the passage words into three categories: answer words, supporting words, and query words, where supporting words are the words surrounding the answer within a window size of 5
Towards Interpreting BERT for Reading Comprehension Based QA
EMNLP 2020, pp.3236-3242, (2020)
BERT and its variants have achieved state-of-the-art performance in various NLP tasks. Since then, various works have been proposed to analyze the linguistic information being captured in BERT. However, the current works do not provide an insight into how BERT is able to achieve near human-level performance on the task of Reading Comprehe...More
PPT (Upload PPT)
- The past decade has witnessed a surge in the development of deep neural network models to solve NLP tasks.
- The authors observe that the initial layers primarily focus on question words that are present in the passage.
- Through a focused analysis of quantifier questions, the authors observe that BERT pays importance to many words similar to the answer in later layers.
- The past decade has witnessed a surge in the development of deep neural network models to solve NLP tasks
- We observe that the initial layers primarily focus on question words that are present in the passage
- The initial layers should focus on question words, and the latter should zoom in on contextual words that point to the answer
- We segregate the passage words into three categories: answer words, supporting words, and query words, where supporting words are the words surrounding the answer within a window size of 5
- Query words are the question words which appear in the passage
- We take the top-5 words marked as important in Il for any layer l and compute how many words from each of the above-defined categories appear in the top-5 words
- The authors focus on analyzing BERT’s layers for RCQA to understand their QA-specific roles and their behavior on potentially confusing quantifier questions.
- The authors aim to understand each BERT layer’s functionality for the RCQA task; the authors want to identify the passage words that are of primary importance at each layer for the answer.
- The authors calculate the integrated gradients at each layer IGl(xi) for all passage words wi using Algorithm 1.
- The authors compute importance scores for each wi by taking the euclidean norm of IG(wi) and normalizing it to get a probability distribution Il over the passage words.
- The authors quantify and visualize a layer’s function as its distribution of importance over the passage words Il. To compute the similarity between any
- Based on the defined layers’ functionality Il, the authors try to identify which layers focus more on the question, the context around the answer, etc.
- We see that the initial layers 0,1 and 2 are trying to connect the passage and the query (‘relegated’, ‘because’, ‘Polonia’ get high importance scores).
- In later layers, the question words separate from the answer and the supporting words, (ii) Across all 12 layers, embeddings for four, eight remain very close together, which could have led to the model making a wrong prediction.
- For the example in Table 4, the authors observed that in its later layers, BERT gives high importance to the words ‘eight’, ‘four’, and
- This shows that BERT, in its later layers, distributes its focus over confusing words.
- This behavior is very different from the assumed roles a layer might take to answer the question, as it is expected that such words were considered in the initial rather than final layers.
- The authors present results and analysis to show that BERT is learning some form of passagequery interaction in its initial layers before arriving at the answer.
- The authors found the following observations interesting and with a potential to be probed further: (i) why do the question word representations move away from contextual and answer representation in later layers?
- (ii) If the focus on confusing words increases from the initial to later layers, how does BERT still have a high accuracy?
- Table1: Semantic statistics of top-5 words - SQuAD
- Table2: Semantic statistics of top-5 words - DuoRC
- Table3: Heatmap visualisation of the Il distribution over BERT’s first and last 3 layers, for a sample from SQuAD. The initial layers focus on question specific words and latter focus on supporting words that lead to answer
- Table4: Sample from the dev-split of SQuAD. Blue shows the answer, purple shows the contextual passage words and green shows the query
- Table5: Part-of-Speech statistics of top-5 words - SQuAD
- Table6: Part-of-Speech statistics of top-5 words - DuoRC
- In the past few years, various large-scale datasets have been proposed for the RCQA task (Nguyen et al, 2016; Joshi et al, 2017; Rajpurkar et al, 2016; Saha et al, 2018) which have led to various deep neural-network (NN) based architectures such as Seo et al (2016); Dhingra et al (2016). Additionally, with complex pretraining, models such as Liu et al (2019); Lan et al (2019); Devlin et al (2018) are very close to human-level performance. Due to the large number of parameters and nonlinearity of deep NN models, the answer to the question “how did the model arrive at the prediction?”, is not known; hence, they are termed as blackbox models. Motivated by this question, there have also been many works that analyze the interpretability of deep NN models on NLP tasks; many of them analyze models based on in-built attention mechanisms (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019). Further, various attribution methods such as Bach et al (2015); Sundararajan et al (2017) have been proposed to analyze them. Tenney et al (2019) and Peters et al (2018b) perform a layerwise analysis of BERT and BERT-like models to assign them syntactic and semantic meaning using probing classifiers. Si et al (2019) question BERT’s working on QA tasks through adversarial attacks, similar to Jia and Liang (2017); Mudrakarta et al (2018). They point out that BERT is prone to be fooled by such attacks. Unlike these earlier works, we focus on analyzing BERT’s layers specifically for RCQA to understand their QA-specific roles and their behavior on potentially confusing quantifier questions.
- We thank Google for supporting Preksha Nema contribution through the Google Ph.D
Study subjects and analysis
uniform samples: 50
We calculate the integrated gradients at each layer IGl(xi) for all passage words wi using Algorithm 1. We approximate the above integral across 50 uniform samples of α ∈ [0, 1]. We then compute importance scores for each wi by taking the euclidean norm of IG(wi) and normalizing it to get a probability distribution Il over the passage words
This further means the two layers consider different words as salient. We visualize heatmaps for the dev-splits of SQuAD (Figures 1a, 1b) and DuoRC (Figures 1c, 1d), averaging over 1000 samples in each case. We analyze the distribution in two parts: (i) we retain only top-k scores in each layer and zero out the rest, which denotes the distribution’s head. (ii) we zero the top-k scores in each layer and retain the rest, which denotes the distribution’s tail
- Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10.
- Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of bert’s attention. CoRR, abs/1906.04341.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.
- Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. CoRR, abs/1902.10186.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. CoRR, abs/1707.07328.
- Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 201Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. CoRR, abs/1705.03551.
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv eprints.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 201Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the model understand the question? CoRR, abs/1805.05492.
- Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
- Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- Matthew E Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. CoRR, abs/1606.05250.
- Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. Duorc: Towards complex language understanding with paraphrased reading comprehension. CoRR, abs/1804.07927.
- Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 20Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
- Sofia Serrano and Noah A Smith. 2019. Is attention interpretable? arXiv preprint arXiv:1906.03731.
- Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing Jiang. 2019. What does bert learn from multiplechoice reading comprehension datasets? arXiv preprint arXiv:1910.12391.
- Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. CoRR, abs/1703.01365.
- Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
- Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. arXiv preprint arXiv:1908.04626.
- Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. CoRR, abs/1611.01604.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.