An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction

Paranjape Bhargavi
Paranjape Bhargavi
Joshi Mandar
Joshi Mandar
Thickstun John
Thickstun John

empirical methods in natural language processing, pp. 1938-1952, 2020.

Other Links: arxiv.org|academic.microsoft.com
Weibo:
We propose a new sparsity objective derived from the Information Bottleneck principle to extract rationales of desired conciseness

Abstract:

Decisions of complex models for language understanding can be explained by limiting the inputs they are provided to a relevant subsequence of the original text — a rationale. Models that condition predictions on a concise rationale, while being more interpretable, tend to be less accurate than models that are able to use the entire contex...More

Code:

Data:

0
Introduction
  • Bastings et al, 2019). During learning, it is common to encourage sparsity by minimizing a norm

    Rationales that select the most relevant parts of an on the rationale masks (e.g. L0 or L1) (Lei et al, input text can help explain model decisions for a 2016; Bastings et al, 2019).
  • It is common to encourage sparsity by minimizing a norm.
  • Models can be faith- accuracy trade-off in norm-minimization methods; ful to a rationale by only using the selected text the authors show that these methods seem to push too dias input for end-task prediction.
  • Where I(·, ·) is mutual information
  • This objective encourages Z to only retain as much information about X as is needed to predict Y.
  • The task loss encourages predicting the correct label y from z to increase I(Z, Y )
Highlights
  • Bastings et al, 2019)
  • We find that in the semisupervised setting, a modest amount of gold rationales (25% of training examples) closes the gap with a model that uses the full input.1 the Information Bottleneck (Tishby et al, 1999) objective (Figure 1)
  • We evaluate performance on five text classification tasks from the ERASER benchmark (DeYoung et al, 2019) and one regression task used in previous work (Lei et al, 2016)
  • To evaluate quality of rationales, we report the token-level Intersection-Over-Union F1 (IOU F1), which is a relaxed measure for comparing two sets of text spans
  • Bang et al (2019) use Information Bottleneck (IB) for post-hoc explanation for sentiment classification. They do not enforce a sparse prior, and as a result, cannot guarantee that the rationale is strictly smaller than the input. This means controlled sparsity, which we have shown to be crucial for task performance and rationale extraction, is harder to achieve in their model
  • We propose a new sparsity objective derived from the Information Bottleneck principle to extract rationales of desired conciseness
Methods
  • 2.1 Task and Method Overview

    The authors assume supervised text classification or regression data that contains tuples of the form (x, y).
  • The authors' goal is to learn a model that predicts y, and extracts a rationale or explanation z—a latent subsequence of sentences in x with the following properties: 1.
  • 2. z must be compact yet sufficient, i.e., it should contain as few sentences as possible without sacrificing the ability to correctly predict y.
  • Following Lei et al (2016), the interpretable model learns a Boolean mask m = (m1, m2, .
  • The authors elaborate on how sufficiency is attained using Information Bottleneck
Results
  • Sparse IB attains task performance within 0.5 − 10% of the full-context model, despite using < 40% of the input sentences on average.
  • The authors observe a positive correlation between task performance and agreement with human rationales.
  • This is important since accurate models that better emulate human rationalization likely engender more trust
Conclusion
  • The authors propose a new sparsity objective derived from the Information Bottleneck principle to extract rationales of desired conciseness.
  • The authors' approach outperforms existing norm-minimization techniques in task performance and agreement with human annotations for rationales for tasks in the ERASER benchmark.
  • The sparse prior objective allows for a straight-forward and accurate control of the amount of sparsity desired in the rationales.
  • The authors are able to close the gap with models that use the full input with < 25% rationale annotations for a majority of the tasks.
Summary
  • Introduction:

    Bastings et al, 2019). During learning, it is common to encourage sparsity by minimizing a norm

    Rationales that select the most relevant parts of an on the rationale masks (e.g. L0 or L1) (Lei et al, input text can help explain model decisions for a 2016; Bastings et al, 2019).
  • It is common to encourage sparsity by minimizing a norm.
  • Models can be faith- accuracy trade-off in norm-minimization methods; ful to a rationale by only using the selected text the authors show that these methods seem to push too dias input for end-task prediction.
  • Where I(·, ·) is mutual information
  • This objective encourages Z to only retain as much information about X as is needed to predict Y.
  • The task loss encourages predicting the correct label y from z to increase I(Z, Y )
  • Objectives:

    The authors' goal is to learn a model that predicts y, and extracts a rationale or explanation z—a latent subsequence of sentences in x with the following properties:.
  • Methods:

    2.1 Task and Method Overview

    The authors assume supervised text classification or regression data that contains tuples of the form (x, y).
  • The authors' goal is to learn a model that predicts y, and extracts a rationale or explanation z—a latent subsequence of sentences in x with the following properties: 1.
  • 2. z must be compact yet sufficient, i.e., it should contain as few sentences as possible without sacrificing the ability to correctly predict y.
  • Following Lei et al (2016), the interpretable model learns a Boolean mask m = (m1, m2, .
  • The authors elaborate on how sufficiency is attained using Information Bottleneck
  • Results:

    Sparse IB attains task performance within 0.5 − 10% of the full-context model, despite using < 40% of the input sentences on average.
  • The authors observe a positive correlation between task performance and agreement with human rationales.
  • This is important since accurate models that better emulate human rationalization likely engender more trust
  • Conclusion:

    The authors propose a new sparsity objective derived from the Information Bottleneck principle to extract rationales of desired conciseness.
  • The authors' approach outperforms existing norm-minimization techniques in task performance and agreement with human annotations for rationales for tasks in the ERASER benchmark.
  • The sparse prior objective allows for a straight-forward and accurate control of the amount of sparsity desired in the rationales.
  • The authors are able to close the gap with models that use the full input with < 25% rationale annotations for a majority of the tasks.
Tables
  • Table1: Task and Rationale IOU F1 for our Sparse IB approach and baselines (Section 4.3) on test sets. Pipeline refers to the Bert-to-Bert method reported in <a class="ref-link" id="cDeyoung_et+al_2019_a" href="#rDeyoung_et+al_2019_a">DeYoung et al (2019</a>), while we use 25% training data in our semisupervised setting (Section 3.3). We report MSE for BeerAdvocate, hence lower is better. BeerAdvocate has no training rationales. Gold IOU is 100.0. Validaton set results can be found in Table 5 in the Appendix
  • Table2: Average mask length (sparsity) attained by Sparse IB and the Sparse Norm-C baseline for a given prior π for different tasks, averaged over 100 runs
  • Table3: Misclassified examples from the Movies and FEVER datasets show: (a) limitations in considering more complex linguistic phenomena like sarcasm; (b) overreliance on shallow lexical matching—unforced vs. forced; (c) limited world knowledge—south Georgia, Southeast region, South Florida. Legend: Model evidence, Gold evidence, Model and Gold Evidence put mask m over sentences (the hamming weight) for 100 runs. Our Sparse IB-approach consistently achieves the sparsity level π used in the prior while the norm-minimization approach (Sparse Norm-C) converges to a lower average sparsity for the mask
  • Table4: Hyperparameters used to report results
  • Table5: Final results of our unsupervised models on ERASER Dev Set experiments. We found that Sparse IB approach is not as sensitive to the parameter β and fix it to 1 to simplify experimental design. Hyperparameters for each dataset used for the final results are presented in Table 4
Download tables as Excel
Related work
  • Interpretability Previous work on explaining model predictions can be broadly categorized into post hoc explanation methods and methods that integrate explanations into the model architecture. Post hoc explanation techniques (Ribeiro et al, 2016; Krause et al, 2017; Alvarez-Melis and Jaakkola, 2017) typically approximate complex decision boundaries with locally linear or low complexity models. While post hoc explanations often have the advantage of being simpler, they are not faithful by construction.

    On the other hand, methods that condition predictions on their explanations can be more trustworthy. Extractive rationalization (Lei et al, 2016) is one of the most well-studied of such methods in NLP, and has received increased attention with the recently released ERASER benchmark (DeYoung et al, 2019). Building on Lei et al (2016), Chang et al (2019) and Yu et al (2019) consider benefits like class-wise explanation extraction while Chang et al (2020) explore invariance to domain shift. Bastings et al (2019) employ a reparameterizable version of the bi-modal beta distribution (instead of Bernoulli) for the binary mask. This more expressive distribution may be able to complement our approach, as KL-divergence for it can be analytically computed (Nalisnick and Smyth, 2017).
Funding
  • Our interpretable model achieves task performance within 10% of a model of comparable size that uses the entire input
Reference
  • Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. 2016. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.
    Findings
  • David Alvarez-Melis and Tommi Jaakkola. 2017. A causal framework for explaining the predictions of black-box sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 412– 421.
    Google ScholarLocate open access versionFindings
  • Seojin Bang, Pengtao Xie, Heewook Lee, Wei Wu, and Eric Xing. 2019. Explaining a black-box using deep variational information bottleneck approach. arXiv preprint arXiv:1902.06918.
    Findings
  • Joost Bastings, Wilker Aziz, and Ivan Titov. 2019. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977.
    Google ScholarLocate open access versionFindings
  • Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. 2019. A game theoretic approach to classwise selective rationalization. In Advances in Neural Information Processing Systems, pages 10055– 10065.
    Google ScholarLocate open access versionFindings
  • Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S Jaakkola. 2020. Invariant rationalization. arXiv preprint arXiv:2003.09772.
    Findings
  • Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. 201Eraser: A benchmark to evaluate rationalized nlp models. arXiv preprint arXiv:1911.03429.
    Findings
  • Emil Julius Gumbel. 1948. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office.
    Google ScholarFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net.
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262.
    Google ScholarLocate open access versionFindings
  • Durk P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575–2583.
    Google ScholarLocate open access versionFindings
  • Josua Krause, Aritra Dasgupta, Jordan Swartz, Yindalon Aphinyanaphongs, and Enrico Bertini. 2017. A workflow for visual diagnostics of binary classifiers using instance-level explanations. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 162–172. IEEE.
    Google ScholarLocate open access versionFindings
  • Veronica Latcinnik and Jonathan Berant. 2020. Explaining question answering models through text generation. arXiv preprint arXiv:2004.05569.
    Findings
  • Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C Wallace. 2019. Inferring which medical treatments work from reports of clinical trials. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3705–3717.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117.
    Google ScholarLocate open access versionFindings
  • Xiang Lisa Li and Jason Eisner. 2019. Specializing word embeddings (for parsing) by information bottleneck. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2744–2754.
    Google ScholarLocate open access versionFindings
  • Julian McAuley, Jure Leskovec, and Dan Jurafsky. 2012. Learning attitudes and attributes from multiaspect reviews. In 2012 IEEE 12th International Conference on Data Mining, pages 1020–1025. IEEE.
    Google ScholarLocate open access versionFindings
  • Eric Nalisnick and Padhraic Smyth. 2017. Stickbreaking variational autoencoders. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144.
    Google ScholarLocate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819.
    Google ScholarLocate open access versionFindings
  • Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377.
    Google ScholarLocate open access versionFindings
  • Daniel S Weld and Gagan Bansal. 2019. The challenge of crafting intelligible intelligence. Communications of the ACM, 62(6):70–79.
    Google ScholarLocate open access versionFindings
  • Peter West, Ari Holtzman, Jan Buys, and Yejin Choi. 2019. Bottlesum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3743–3752.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
    Google ScholarLocate open access versionFindings
  • Mo Yu, Shiyu Chang, Yang Zhang, and Tommi Jaakkola. 2019. Rethinking cooperative rationalization: Introspective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4085–4094.
    Google ScholarLocate open access versionFindings
  • Andrey Zhmoginov, Ian Fischer, and Mark Sandler. 2019. Information-bottleneck approach to salient region discovery. arXiv preprint arXiv:1907.09578.
    Findings
  • We first present an overview of the variational bound on IB introduced by (Alemi et al., 2016) and then derive a modified version amenable to interpretability.
    Google ScholarFindings
  • A.1 Variational Information Bottleneck (Alemi et al. (2016))
    Google ScholarLocate open access versionFindings
  • the desired constraint on the marginal. In contrast to Alemi et al. (2016), our prior r(z)
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments