Obtaining Faithful Interpretations from Compositional Neural Networks

ACL, pp. 5594-5608, 2020.

Cited by: 1|Bibtex|Views133
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We introduce the concept of module-wise faithfulness aimed at evaluating whether each module has correctly learned its intended operation by judging the correctness of its outputs in a trained Neural module networks

Abstract:

Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describ...More
0
Introduction
  • Models that can read text and reason about it in a particular context have been recently gaining increased attention, leading to the creation of multiple datasets that require reasoning in both the visual and textual domain (Johnson et al, 2016; Suhr et al, 2017; Talmor and Berant, 2018; Yang et al, 2018a; Suhr et al, 2019; Hudson and Manning, 2019; Dua et al, 2019).
  • The authors show on both NLVR2 (Suhr et al, 2019) and DROP (Dua et al, 2019) that training a NMN using end-task supervision, even using gold programs, does not yield modulewise faithfulness, i.e., the modules do not perform their intended reasoning task.
  • Neural module networks (NMNs) facilitate interpretability of their predictions via the reasoning steps in the structured program and providing the outputs of those intermediate steps during execution.
Highlights
  • Models that can read text and reason about it in a particular context have been recently gaining increased attention, leading to the creation of multiple datasets that require reasoning in both the visual and textual domain (Johnson et al, 2016; Suhr et al, 2017; Talmor and Berant, 2018; Yang et al, 2018a; Suhr et al, 2019; Hudson and Manning, 2019; Dua et al, 2019)
  • We provide three primary contributions regarding faithfulness in Neural module networks
  • We examine Neural module networks for both text and images, and describe their modules
  • All reasoning steps taken by both the Visual-Neural module networks and Text-Neural module networks can be discerned from the program and the intermediate module outputs
  • We introduce the concept of module-wise faithfulness aimed at evaluating whether each module has correctly learned its intended operation by judging the correctness of its outputs in a trained Neural module networks
  • We introduce the concept of module-wise faithfulness, a systematic evaluation of faithfulness in neural module networks (NMNs) for visual and textual reasoning
Results
  • In Figure 2, all reasoning steps taken by both the Visual-NMN and Text-NMN can be discerned from the program and the intermediate module outputs.
  • The authors introduce the concept of module-wise faithfulness aimed at evaluating whether each module has correctly learned its intended operation by judging the correctness of its outputs in a trained NMN.
  • The authors sampled examples from the development set, and annotated gold bounding boxes for each instance of find, filter, with-relation and relocate.
  • The annotator draws the correct bounding-boxes for each module in the gold program, similar to the output in Figure 2.
  • To measure module-wise faithfulness in Text-NMN, the authors obtain annotations for the set of spans that should be output by each module in the gold program (as seen in Figure 2) Ideally, all modules should predict high probability for tokens that appear in the gold spans and zero probability for other tokens.
  • The authors demonstrate that training NMNs using end-task supervision only does not yield module-wise faithfulness both for visual and textual reasoning.
  • For module-wise faithfulness evaluation, 536 examples from the development set were annotated with the gold output for each module by experts.
  • Module-wise faithfulness is measured on 215 manually-labeled questions from the development set, which are annotated with gold programs and module outputs.
  • Textual reasoning As seen in Table 2, when trained on DROP using question-program supervision, the model achieves 65.3 F1 performance and a faithfulness score of 11.2.
Conclusion
  • Similar to Visual-NMN, this shows that supervising intermediate modules in a program leads to better faithfulness.
  • Figure 3c shows that find outputs meaningless probabilities for most of the bounding boxes when trained with Layer-count, yet the count module produces the correct value.
  • Similar to the gold module output annotations that the authors provide and evaluate against, HotpotQA (Yang et al, 2018a) and CoQA (Reddy et al, 2019) datasets include supporting facts or rationales for the answers to their questions, which can be used for both supervision and evaluation.
Summary
  • Models that can read text and reason about it in a particular context have been recently gaining increased attention, leading to the creation of multiple datasets that require reasoning in both the visual and textual domain (Johnson et al, 2016; Suhr et al, 2017; Talmor and Berant, 2018; Yang et al, 2018a; Suhr et al, 2019; Hudson and Manning, 2019; Dua et al, 2019).
  • The authors show on both NLVR2 (Suhr et al, 2019) and DROP (Dua et al, 2019) that training a NMN using end-task supervision, even using gold programs, does not yield modulewise faithfulness, i.e., the modules do not perform their intended reasoning task.
  • Neural module networks (NMNs) facilitate interpretability of their predictions via the reasoning steps in the structured program and providing the outputs of those intermediate steps during execution.
  • In Figure 2, all reasoning steps taken by both the Visual-NMN and Text-NMN can be discerned from the program and the intermediate module outputs.
  • The authors introduce the concept of module-wise faithfulness aimed at evaluating whether each module has correctly learned its intended operation by judging the correctness of its outputs in a trained NMN.
  • The authors sampled examples from the development set, and annotated gold bounding boxes for each instance of find, filter, with-relation and relocate.
  • The annotator draws the correct bounding-boxes for each module in the gold program, similar to the output in Figure 2.
  • To measure module-wise faithfulness in Text-NMN, the authors obtain annotations for the set of spans that should be output by each module in the gold program (as seen in Figure 2) Ideally, all modules should predict high probability for tokens that appear in the gold spans and zero probability for other tokens.
  • The authors demonstrate that training NMNs using end-task supervision only does not yield module-wise faithfulness both for visual and textual reasoning.
  • For module-wise faithfulness evaluation, 536 examples from the development set were annotated with the gold output for each module by experts.
  • Module-wise faithfulness is measured on 215 manually-labeled questions from the development set, which are annotated with gold programs and module outputs.
  • Textual reasoning As seen in Table 2, when trained on DROP using question-program supervision, the model achieves 65.3 F1 performance and a faithfulness score of 11.2.
  • Similar to Visual-NMN, this shows that supervising intermediate modules in a program leads to better faithfulness.
  • Figure 3c shows that find outputs meaningless probabilities for most of the bounding boxes when trained with Layer-count, yet the count module produces the correct value.
  • Similar to the gold module output annotations that the authors provide and evaluate against, HotpotQA (Yang et al, 2018a) and CoQA (Reddy et al, 2019) datasets include supporting facts or rationales for the answers to their questions, which can be used for both supervision and evaluation.
Tables
  • Table1: Faithfulness and accuracy on NLVR2. “decont.” refers to decontextualized word representations. Precision, recall, and F1 are averages across examples, and thus F1 is not the harmonic mean of the corresponding precision and recall
  • Table2: Faithfulness and performance scores for various NMNs on DROP. ∗lower is better. †min-max is average faithfulness of find-min-num and find-max-num; find-arg of find-num and find-date
  • Table3: Implementations of modules for NLVR2 NMN. First five contain parameters, the rest are deterministic
  • Table4: Faithfulness scores on NLVR2 using the cumulative precision/recall/F1 evaluation
  • Table5: Faithfulness scores on NLVR2 using the average over module occurrences evaluation
  • Table6: Faithfulness scores on NLVR2 using a negative IOU threshold of 10−8 and example-wise averaging
Download tables as Excel
Related work
  • NMNs were originally introduced for visual question answering and applied to datasets with synfind[llamas] utt: “the llamas in both images are eating” (a)

    (b) find[people] utt: “there are three people”

    count (c) find[touchdown run]

    The Redskins obtained an early lead when RB Clinton Portis scored on a 3-yard TD run. St. Louis scored again when free safety Oshiomogho Atogwe scored a 75 yards touchdown. Washington regained the lead with ..... and a Clinton Portis 2-yard rushing TD. St. Louis would come back with a 49-yard field goal.
Funding
  • This research was partially supported by The Yandex Initiative for Machine Learning, the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800), funding by the ONR under Contract No N00014-19-12620, and by sponsorship from the LwLL DARPA program under Contract No FA8750-19-2-0201
Reference
  • Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. In Proceedings of NAACL-HLT, pages 1545–1554.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378.
    Google ScholarLocate open access versionFindings
  • Nitish Gupta, Kevin Lin, Dan Roth, Sameer Singh, and Matt Gardner. 2020. Neural Module Networks for Reasoning over Text. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. 201Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
    Findings
  • Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019. A multi-type multi-span network for reading comprehension that requires discrete reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1596–1606, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 804–813.
    Google ScholarLocate open access versionFindings
  • Drew A Hudson and Christopher D Manning. 201Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
    Google ScholarLocate open access versionFindings
  • Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 2020 Conference of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yichen Jiang and Mohit Bansal. 2019. Self-assembling modular networks for interpretable multi-hop reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4473–4483, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2016. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1988– 1997.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. 2017. Neural semantic parsing with type constraints for semi-structured tables. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1516–1526.
    Google ScholarLocate open access versionFindings
  • Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. 2019. The neurosymbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4249–4257, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • E.W. Noreen. 1989. Computer-Intensive Methods for Testing Hypotheses: An Introduction. Wiley.
    Google ScholarFindings
  • Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 91–99, Cambridge, MA, USA. MIT Press.
    Google ScholarLocate open access versionFindings
  • Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. In IJCAI.
    Google ScholarFindings
  • Howard Seltman. 2018. Approximations for mean and variance of a ratio.
    Google ScholarFindings
  • Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223.
    Google ScholarLocate open access versionFindings
  • Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428.
    Google ScholarLocate open access versionFindings
  • Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of NAACL-HLT, pages 641–651.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5099–5110, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018a. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018b. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 18th conference on Computational linguistics-Volume 2, pages 947–953. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yan Zhang, Jonathon Hare, and Adam Prgel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Alexander Trott, Caiming Xiong, and Richard Socher. 2018. Interpretable counting for visual question answering. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Dan Ventura. 2007. CS478 Paired Permutation Test Overview. Accessed April 29, 2020.
    Google ScholarFindings
  • Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
    Google ScholarLocate open access versionFindings
  • We list all modules for Visual-NMN in Table 3. For Text-NMN, as mentioned, we use all modules are described in Gupta et al. (2020). In this work, we introduce the (a) addition and subtraction modules that take as input two distributions over numbers mentioned in the passage and produce a distribution over all posssible addition and subtraction values possible. The output distribution here is the expected distribution for the random variable Z = X +Y (for addition), and (b) extract-answer that produces two distributions over the passage tokens denoting the probabilities for the start and end of the answer span. This distribution is computed by mapping the passage token representations using a simple MLP and softmax operation.
    Google ScholarLocate open access versionFindings
  • 2. Average over module occurrences: For each module type, for each occurrence of the module we compute a precision and recall and compute F1 as the harmonic mean of precision and recall. Then for each module type, we compute the overall precision as the average precision across module occurrences and similarly compute the overall recall and F1. Note that a module can occur multiple times in a single program and that each image is considered a separate occurrence. The results using this method are in Table 5. Visual Reasoning We use the published pretrained weights and the same training configuration of LXMERT (Tan and Bansal, 2019), with 36 bounding boxes proposed per image. Due to memory constraints, we restrict training data to examples having a gold program with at most 13 modules.
    Google ScholarFindings
  • We generated program annotations for NLVR2 by automatically canonicalizing its question decompositions in the Break dataset (Wolfson et al., 2020). Decompositions were originally annotated by Amazon Mechanical Turk workers. For each utterance, the workers were asked to produce the correct decomposition and an utterance attention for each operator (module), whenever relevant.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments