Generating Visual Explanations

ECCV, pp. 3-19, 2016.

Cited by: 279|Bibtex|Views238|DOI:https://doi.org/10.1007/978-3-319-46493-0_1
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We consider explanations as determining why a certain decision is consistent with visual evidence, and differentiate between introspection explanation systems which explain how a model determines its final output and justification explanation systems which produce sentences detai...

Abstract:

Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discrimi...More

Code:

Data:

0
Introduction
  • Explaining why the output of a visual system is compatible with visual evidence is a key component for understanding and interacting with AI systems [1].
  • We consider explanations as determining why a certain decision is consistent with visual evidence, and differentiate between introspection explanation systems which explain how a model determines its final output (e.g., “This is a Western Grebe because filter 2 has a high activation...”) and justification explanation systems which produce sentences detailing how visual evidence is compatible with a system output (e.g., “This is a Western Grebe because it has red eyes...”).
Highlights
  • Explaining why the output of a visual system is compatible with visual evidence is a key component for understanding and interacting with AI systems [1]
  • We consider explanations as determining why a certain decision is consistent with visual evidence, and differentiate between introspection explanation systems which explain how a model determines its final output (e.g., “This is a Western Grebe because filter 2 has a high activation...”) and justification explanation systems which produce sentences detailing how visual evidence is compatible with a system output (e.g., “This is a Western Grebe because it has red eyes...”)
  • We concentrate on justification explanation systems because such systems may be more useful to non-experts who do not have detailed knowledge of modern computer vision systems [1]
  • We demonstrate that our model produces visual explanations by showing that our generated explanations fulfill the two aspects of our proposed definition of visual explanation and are image relevant and class relevant
  • Our work is an important step towards explaining deep visual
Methods
  • Unlike other image-sentence datasets, every image in the CUB dataset belongs to a class, and sentences as well as images are associated with a single label
  • This property makes this dataset unique for the visual explanation task, where the aim is to generate sentences that are both discriminative and classspecific.
  • The authors stress that sentences collected in [30] were not collected for the task of visual explanation
  • They do not explain why an image belongs to a certain class, but rather include discriptive details about each bird class
Results
  • The authors demonstrate that the model produces visual explanations by showing that the generated explanations fulfill the two aspects of the proposed definition of visual explanation and are image relevant and class relevant.
  • The authors' explanation model produces a higher class similarity score than other models by a substantial margin.
  • The authors' ranking metric is quite difficult; sentences must include enough information to differentiate between very similar bird classes without looking at an image, and the results clearly show that the explanation model performs best at this difficult task.
  • The authors' explanation model has the best mean rank, followed by the description model
  • This trend resembles the trend seen when evaluating class relevance.
  • An obvious example of this is in Figure 5 row 7 where the explanation model includes only attributes present in the image of the “hooded merganser”, whereas all other models mention at least one incorrect attribute
Conclusion
  • Explanation is an important capability for deployment of intelligent systems. Visual explanation is a rich research direction, especially as the field of computer vision continues to employ and improve deep models which are not interpretable.
  • The authors' work is an important step towards explaining deep visual.
  • This is a Black-Capped Vireo because.
  • Description: this bird has a white belly and breast black and white wings with a white wingbar.
  • This is a White Pelican because.
Summary
  • Introduction:

    Explaining why the output of a visual system is compatible with visual evidence is a key component for understanding and interacting with AI systems [1].
  • We consider explanations as determining why a certain decision is consistent with visual evidence, and differentiate between introspection explanation systems which explain how a model determines its final output (e.g., “This is a Western Grebe because filter 2 has a high activation...”) and justification explanation systems which produce sentences detailing how visual evidence is compatible with a system output (e.g., “This is a Western Grebe because it has red eyes...”).
  • Methods:

    Unlike other image-sentence datasets, every image in the CUB dataset belongs to a class, and sentences as well as images are associated with a single label
  • This property makes this dataset unique for the visual explanation task, where the aim is to generate sentences that are both discriminative and classspecific.
  • The authors stress that sentences collected in [30] were not collected for the task of visual explanation
  • They do not explain why an image belongs to a certain class, but rather include discriptive details about each bird class
  • Results:

    The authors demonstrate that the model produces visual explanations by showing that the generated explanations fulfill the two aspects of the proposed definition of visual explanation and are image relevant and class relevant.
  • The authors' explanation model produces a higher class similarity score than other models by a substantial margin.
  • The authors' ranking metric is quite difficult; sentences must include enough information to differentiate between very similar bird classes without looking at an image, and the results clearly show that the explanation model performs best at this difficult task.
  • The authors' explanation model has the best mean rank, followed by the description model
  • This trend resembles the trend seen when evaluating class relevance.
  • An obvious example of this is in Figure 5 row 7 where the explanation model includes only attributes present in the image of the “hooded merganser”, whereas all other models mention at least one incorrect attribute
  • Conclusion:

    Explanation is an important capability for deployment of intelligent systems. Visual explanation is a rich research direction, especially as the field of computer vision continues to employ and improve deep models which are not interpretable.
  • The authors' work is an important step towards explaining deep visual.
  • This is a Black-Capped Vireo because.
  • Description: this bird has a white belly and breast black and white wings with a white wingbar.
  • This is a White Pelican because.
Tables
  • Table1: Comparison of our explanation model to our definition and description baseline, as well as the explanation-label and explanation-discriminative (explanation-dis. in the table) ablation models. We demonstrate that our generated explanations are image relevant by computing METEOR and CIDEr scores (higher is better). We demonstrate class relevance using a class similarity metric (higher is better) and class rank metric (lower is better) (see Section 4 for details). Finally, we ask experienced bird watchers to rank our explanations. On all metrics, our explanation model performs best
Download tables as Excel
Related work
  • Explanation. Automatic reasoning and explanation has a long and rich history within the artificial intelligence community [1,13,14,15,16,17,18,19]. Explanation systems span a variety of applications including explaining medical diagnosis [13], simulator actions [14,15,16,19], and robot movements [17]. Many of these systems are rule-based [13] or solely reliant on filling in a predetermined template [16]. Methods such as [13] require expert-level explanations and decision processes. In contrast, our visual explanation method is learned directly from data by optimizing explanations to fulfill our two proposed visual explanation criteria. Our model is not provided with expert explanations or decision processes, but rather learns from visual features and text descriptions. In contrast to systems like [13,14,15,16,17,18] which aim to explain the underlying mechanism behind a decision, authors in [1] concentrate on why a prediction is justifiable to a user. Such systems are advantageous because they do not rely on user familiarity with the design of an intelligent system in order to provide useful information.
Funding
  • This work was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Vision and Learning Center
  • Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD)
  • Lisa Anne Hendricks is supported by an NDSEG fellowship
Reference
  • Biran, O., McKeown, K.: Justification narratives for individual classifications. In: Proceedings of the AutoML workshop at ICML 2014. (2014)
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
    Google ScholarFindings
  • Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
    Google ScholarLocate open access versionFindings
  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. Proceedings of the International Conference on Machine Learning (ICML) (2013)
    Google ScholarLocate open access versionFindings
  • Teach, R.L., Shortliffe, E.H.: An analysis of physician attitudes regarding computer-based clinical consultation systems. In: Use and impact of computers in clinical medicine. Springer (1981) 68–85
    Google ScholarFindings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE (2009) 248–255
    Google ScholarLocate open access versionFindings
  • Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR. (2015)
    Google ScholarFindings
  • Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
    Google ScholarLocate open access versionFindings
  • Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR. (2015)
    Google ScholarFindings
  • Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning (ICML) (2015)
    Google ScholarLocate open access versionFindings
  • Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). (2014) 595–603
    Google ScholarLocate open access versionFindings
  • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8) (November 1997) 1735–1780
    Google ScholarLocate open access versionFindings
  • Shortliffe, E.H., Buchanan, B.G.: A model of inexact reasoning in medicine. Mathematical biosciences 23(3) (1975) 351–379
    Google ScholarLocate open access versionFindings
  • Lane, H.C., Core, M.G., Van Lent, M., Solomon, S., Gomboc, D.: Explainable artificial intelligence for training and tutoring. Technical report, DTIC Document (2005)
    Google ScholarFindings
  • Core, M.G., Lane, H.C., Van Lent, M., Gomboc, D., Solomon, S., Rosenberg, M.: Building explainable artificial intelligence systems. In: Proceedings of the national conference on artificial intelligence. Volume 21., Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999 (2006) 1766
    Google ScholarLocate open access versionFindings
  • Van Lent, M., Fisher, W., Mancuso, M.: An explainable artificial intelligence system for small-unit tactical behavior. In: PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999 (2004) 900–907
    Google ScholarLocate open access versionFindings
  • Lomas, M., Chevalier, R., Cross II, E.V., Garrett, R.C., Hoare, J., Kopack, M.: Explaining robot actions. In: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, ACM (2012) 187–188
    Google ScholarLocate open access versionFindings
  • Lacave, C., Dıez, F.J.: A review of explanation methods for bayesian networks. The Knowledge Engineering Review 17(02) (2002) 107–127
    Google ScholarLocate open access versionFindings
  • Johnson, W.L.: Agents that learn to explain themselves. In: AAAI. (1994) 1257– 1263
    Google ScholarFindings
  • Berg, T., Belhumeur, P.: How do you tell a blackbird from a crow? In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 9–16
    Google ScholarLocate open access versionFindings
  • Jiang, Z., Wang, Y., Davis, L., Andrews, W., Rozgic, V.: Learning discriminative features via label consistent neural network. arXiv preprint arXiv:1602.01168 (2016)
    Findings
  • Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.: What makes paris look like paris? ACM Transactions on Graphics 31(4) (2012)
    Google ScholarLocate open access versionFindings
  • Kulkarni, G., Premraj, V., Dhar, S., Li, S., choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: CVPR. (2011)
    Google ScholarFindings
  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 2712–2719
    Google ScholarLocate open access versionFindings
  • Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1473–1482
    Google ScholarLocate open access versionFindings
  • Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. NIPS Deep Learning Workshop (2014)
    Google ScholarLocate open access versionFindings
  • Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding long-short term memory for image caption generation. Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
    Google ScholarLocate open access versionFindings
  • Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
    Google ScholarLocate open access versionFindings
  • Lampert, C., Nickisch, H., Harmeling, S.: Attribute-based classification for zeroshot visual object categorization. In: TPAMI. (2013)
    Google ScholarFindings
  • Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of finegrained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
    Google ScholarLocate open access versionFindings
  • Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). (2016)
    Google ScholarLocate open access versionFindings
  • Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 (2015)
    Findings
  • Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning (1992)
    Google ScholarFindings
  • Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology (2011)
    Google ScholarFindings
  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, ACM (2014) 675–678
    Google ScholarLocate open access versionFindings
  • Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Volume 29. (2005) 65–72
    Google ScholarLocate open access versionFindings
  • Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 4566–4575
    Google ScholarLocate open access versionFindings
  • Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (2002) 311–318
    Google ScholarLocate open access versionFindings
  • Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to wordnet: An on-line lexical database*. International journal of lexicography 3(4) (1990) 235–244
    Google ScholarLocate open access versionFindings
  • Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: Describing novel object categories without paired training data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
    Google ScholarLocate open access versionFindings
  • Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 2533–2541
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments