SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions

CVPR, pp. 10000-10008, 2020.

Cited by: 0|Bibtex|Views167|DOI:https://doi.org/10.1109/CVPR42600.2020.01002
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems

Abstract:

Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answere...More

Code:

Data:

0
Introduction
  • Human cognition is thought to be compositional in nature: the visual system recognizes multiple aspects of a scene which are combined into shapes [7] and understandings.
  • Perception questions only require visual perception to recognize existence, physical properties or spatial relationships among entities, such as “What color is the banana?” or “What is to the left of the man?”, while Reasoning questions require the composition of multiple perceptual tasks and knowledge that harnesses logic and prior knowledge about the world, such as “Is the banana ripe enough to eat?”
Highlights
  • Human cognition is thought to be compositional in nature: the visual system recognizes multiple aspects of a scene which are combined into shapes [7] and understandings
  • To answer the question, “Is the banana ripe enough to eat?” (Figure 1), a Visual Question Answering model has to be able to detect the bananas and extract associated properties such as size and color, understand what the question is asking, and reason about how these properties relate to known properties of edible bananas and how they manifest
  • “What is to the left of the man?”, while Reasoning questions require the composition of multiple perceptual tasks and knowledge that harnesses logic and prior knowledge about the world, such as “Is the banana ripe enough to eat?”
  • We show that state-of-the-art Visual Question Answering models have similar accuracy in answering perception and reasoning tasks but have problems with consistency; in 28.14% of the cases where models answer the reasoning question correctly, they fail to answer the corresponding perception sub-question, highlighting problems with consistency and the risk that models may be learning to answer reasoning questions through learning common answers and biases
  • The Attention Correlation numbers indicate that Sub-Question Importance-aware Network Tuning really is helping the model use the appropriate visual grounding at test time, even though the model was trained on VQAv1 and evaluated on VQAv2
  • We proposed preliminary approaches that seem promising: finetuning on Visual Question Answering-introspect and Sub-Question Importance-aware Network Tuning both improve the consistency of the SOTA model with no discernible loss in accuracy, and Sub-Question Importance-aware Network Tuning results in qualitatively better attention maps
Methods
  • Method Pythia

    Pythia + VQA-introspect data Pythia + VQA-introspect + SQuINT

    Consistency Metric M S ↓ Consistency% ↑ Consistency% ↑ Attn Corr ↑

    VQA Accuracy Overall ↑ Reasoning ( M S + M S ) ↑

    The results in Table 1 indicate that fine-tuning on VQAintrospect, increases consistency without hurting accuracy or Reasoning accuracy.
  • The Attention Correlation numbers indicate that SQuINT really is helping the model use the appropriate visual grounding at test time, even though the model was trained on VQAv1 and evaluated on VQAv2
  • This effect does not seem to happen with naive finetuning on VQA-introspect.
  • The model finetuned on SQuINT, on the other hand, attends to regions that are informative in both main and sub-questions
  • This is further indication that SQuINT is helping the model reason in ways that will generalize when it answers Reasoning questions correctly, rather than use shortcuts
Conclusion
  • Discussion and Future

    Work

    The VQA task requires multiple capabilities in different modalities and at different levels of abstraction.
  • Similar efforts to ours could be employed at different points in the abstraction scale, e.g. further dividing complex Perception questions into simpler components, or further dividing the Reasoning part into different forms of background knowledge, logic, etc
  • The authors consider such efforts crucial in the quest to evaluate and train models that truly generalize, and hope VQAintrospect spurs more research in that direction
Summary
  • Introduction:

    Human cognition is thought to be compositional in nature: the visual system recognizes multiple aspects of a scene which are combined into shapes [7] and understandings.
  • Perception questions only require visual perception to recognize existence, physical properties or spatial relationships among entities, such as “What color is the banana?” or “What is to the left of the man?”, while Reasoning questions require the composition of multiple perceptual tasks and knowledge that harnesses logic and prior knowledge about the world, such as “Is the banana ripe enough to eat?”
  • Methods:

    Method Pythia

    Pythia + VQA-introspect data Pythia + VQA-introspect + SQuINT

    Consistency Metric M S ↓ Consistency% ↑ Consistency% ↑ Attn Corr ↑

    VQA Accuracy Overall ↑ Reasoning ( M S + M S ) ↑

    The results in Table 1 indicate that fine-tuning on VQAintrospect, increases consistency without hurting accuracy or Reasoning accuracy.
  • The Attention Correlation numbers indicate that SQuINT really is helping the model use the appropriate visual grounding at test time, even though the model was trained on VQAv1 and evaluated on VQAv2
  • This effect does not seem to happen with naive finetuning on VQA-introspect.
  • The model finetuned on SQuINT, on the other hand, attends to regions that are informative in both main and sub-questions
  • This is further indication that SQuINT is helping the model reason in ways that will generalize when it answers Reasoning questions correctly, rather than use shortcuts
  • Conclusion:

    Discussion and Future

    Work

    The VQA task requires multiple capabilities in different modalities and at different levels of abstraction.
  • Similar efforts to ours could be employed at different points in the abstraction scale, e.g. further dividing complex Perception questions into simpler components, or further dividing the Reasoning part into different forms of background knowledge, logic, etc
  • The authors consider such efforts crucial in the quest to evaluate and train models that truly generalize, and hope VQAintrospect spurs more research in that direction
Tables
  • Table1: Results on held out VQAv2 validation set for (1) Consistency metrics along the four quadrants described in Section 5 and Consistency and Attention Correlation metrics as described in Section 5 (metrics), and (2) Overall and Reasoning accuracy. The Reasoning accuracy is obtained by only looking at the number of times the main question is correct ( M S + M S)
Download tables as Excel
Related work
  • Visual Question Answering [3], one of the most widely studied vision-and-language problems, requires associating image content with natural language questions and answers (thus combining perception, language understanding, background knowledge and reasoning). However, it is possible for models to do well on the task by exploiting language and dataset biases, e.g. answering “yellow” to “What color is the banana?” without regard for the image or by answering “yes” to most yes-no questions [1, 12, 18, 21, 2]. This motivates additional forms of evaluation, e.g. checking if the model can understand question rephrasings [20] or whether it exhibits logical consistency [16]. In this work, we present a novel evaluation of questions that require reasoning capabilities, where we check for consistency between how models answer higher level Reasoning questions and how they answer corresponding Perception sub-questions.

    A variety of datasets have been released with attention annotations on the image pointing to regions that are important to answer questions ([4, 10]), with corresponding work on enforcing such grounding [17, 14, 18]. Our work is complementary to these approaches, as we provide languagebased grounding (rather than visual), and further evaluate the link between perception capabilities and how they are composed by models for answering Reasoning tasks. Closer to our work is the dataset of Lisa et al [10], where natural language justifications are associated with (question, answer) pairs. However, most of the questions contemplated (like much of the VQA dataset) pertain to perception questions (e.g. for the question-answer “What is the person doing? Snowboarding”, the justification is “...they are on a snowboard ...”). Furthermore, it is hard to use natural language justifications to evaluate models that do not generate similar rationales (i.e. most SOTA models), or even coming up with metrics for models that do. In contrast, our dataset and evaluation is in the same modality (QA) that models are already trained to handle.
Funding
  • The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon
Reference
  • Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425– 2433, 2015.
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Jerry A Fodor and Zenon W Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(12):3–71, 1988.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Donald D Hoffman and Whitman A Richards. Parts of recognition. Cognition, 18(1-3):65–96, 1984.
    Google ScholarLocate open access versionFindings
  • Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067, 2018.
    Findings
  • Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700– 6709, 2019.
    Google ScholarLocate open access versionFindings
  • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8779–8788, 2018.
    Google ScholarLocate open access versionFindings
  • Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956, 2018.
    Findings
  • Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1338–1346, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165.
    Google ScholarLocate open access versionFindings
  • Tingting Qiao, Jianfeng Dong, and Duanqing Xu. Exploring human-like attention supervision in visual question answering. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961, 2015.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. Are red roses red? evaluating consistency of questionanswering models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6174–6184, Florence, Italy, July 2019. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Interrogating vqa models with sub-questions. arXiv preprint arXiv:2001.06927, 2020.
    Findings
  • Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6649–6658, 2019.
    Google ScholarLocate open access versionFindings
  • Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Yin and Yang: Balancing and answering binary visual questions. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments