Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), pp. 2654-2665, 2018.

Cited by: 63|Bibtex|Views70
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We develop a visual question answering algorithm based on graph convolutional nets which benefits from general knowledge encoded in the form of a knowledge base

Abstract:

Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel 'fact-based' visual question answering (FVQA) task has been introduced recently al...More

Code:

Data:

0
Introduction
  • When answering questions about images, the authors combine the visualized situation with general knowledge that is available to us.
  • For algorithms, an effortless combination of general knowledge with observations remains challenging, despite significant work which aims to leverage these mechanisms for autonomous agents and virtual assistants.
  • A significant amount of research has investigated algorithms for visual question answering (VQA) [2, 19, 32, 43, 44, 63], visual question generation (VQG) [21, 30, 36, 49], and visual dialog [12,13,20], paving the way to autonomy for artificial agents operating in the real world.
Highlights
  • When answering questions about images, we combine the visualized situation with general knowledge that is available to us
  • A significant amount of research has investigated algorithms for visual question answering (VQA) [2, 19, 32, 43, 44, 63], visual question generation (VQG) [21, 30, 36, 49], and visual dialog [12,13,20], paving the way to autonomy for artificial agents operating in the real world
  • Using an ablation analysis we find improvements due to the Graph Convolution Network component, which exploits the graphical structure of the knowledge base and allows for sharing of information between possible answers, improving the explainability of our model
  • We develop a visual question answering algorithm based on graph convolutional nets which benefits from general knowledge encoded in the form of a knowledge base
  • We evaluate our method on the dataset released as part of the fact-based’ visual question answering work, referred to as the fact-based’ visual question answering dataset [50], which is a subset of three structured databases – DBpedia [3], ConceptNet [45], and WebChild [47]
  • We developed a method for ‘reasoning’ in factual visual question answering using graph convolution nets
Methods
  • Methods for Multilingual Image Question Answering

    In NeurIPS, 2015. [18] D.
  • Methods for Multilingual Image Question Answering.
  • In NeurIPS, 2015.
  • Iqa: Visual question answering in interactive environments.
  • In CVPR, 2018.
  • Van der Maaten.
  • Revisiting Visual Question Answering Baselines.
  • In ECCV, 2016.
  • G. Schwing.
  • Two can play this Game: Visual Dialog with Discriminative
Results
  • The authors' best model ‘13’ outperforms the state-of-the-art STTF technique by more than 7% and the FVQA baseline without ensemble by over 12%.
Conclusion
  • The authors developed a method for ‘reasoning’ in factual visual question answering using graph convolution nets.
  • The authors showed that the proposed algorithm outperforms existing baselines by a large margin of 7%.
  • The authors attribute these improvements to ‘joint reasoning about answers,’ which facilitates sharing of information before making an informed decision.
  • The authors achieve this high increase in performance by using only the ground truth relation and answer information, with no reliance on the ground truth fact.
  • The authors thank Arun Mallya and Aditya Deshpande for their help
Summary
  • Introduction:

    When answering questions about images, the authors combine the visualized situation with general knowledge that is available to us.
  • For algorithms, an effortless combination of general knowledge with observations remains challenging, despite significant work which aims to leverage these mechanisms for autonomous agents and virtual assistants.
  • A significant amount of research has investigated algorithms for visual question answering (VQA) [2, 19, 32, 43, 44, 63], visual question generation (VQG) [21, 30, 36, 49], and visual dialog [12,13,20], paving the way to autonomy for artificial agents operating in the real world.
  • Methods:

    Methods for Multilingual Image Question Answering

    In NeurIPS, 2015. [18] D.
  • Methods for Multilingual Image Question Answering.
  • In NeurIPS, 2015.
  • Iqa: Visual question answering in interactive environments.
  • In CVPR, 2018.
  • Van der Maaten.
  • Revisiting Visual Question Answering Baselines.
  • In ECCV, 2016.
  • G. Schwing.
  • Two can play this Game: Visual Dialog with Discriminative
  • Results:

    The authors' best model ‘13’ outperforms the state-of-the-art STTF technique by more than 7% and the FVQA baseline without ensemble by over 12%.
  • Conclusion:

    The authors developed a method for ‘reasoning’ in factual visual question answering using graph convolution nets.
  • The authors showed that the proposed algorithm outperforms existing baselines by a large margin of 7%.
  • The authors attribute these improvements to ‘joint reasoning about answers,’ which facilitates sharing of information before making an informed decision.
  • The authors achieve this high increase in performance by using only the ground truth relation and answer information, with no reliance on the ground truth fact.
  • The authors thank Arun Mallya and Aditya Deshpande for their help
Tables
  • Table1: Recall and downstream accuracy for different number of facts
  • Table2: Answer accuracy over the FVQA dataset
  • Table3: Error contribution of the sub-components of the model to the total Top-1 error (30.65%)
Download tables as Excel
Related work
  • We develop a visual question answering algorithm based on graph convolutional nets which benefits from general knowledge encoded in the form of a knowledge base. We therefore briefly review existing work in the areas of visual question answering, fact-based visual question answering and graph convolutional networks.

    Visual Question Answering: Recently, there has been significant progress in creating large VQA datasets [2, 17, 23, 34, 41, 66] and deep network models which correctly answer a question about an image. The initial VQA models [1, 2, 4, 11, 16, 17, 19, 24, 32, 33, 35, 41, 44, 53, 57, 59, 60, 65, 68] combined the LSTM encoding of the question and the CNN encoding of the image using a deep network which finally predicted the answer. Results can be improved with attention-based multimodal networks [1, 11, 16, 32, 43, 44, 57, 59] and dynamic memory networks [22, 56]. All of these methods were tested on standard VQA datasets where the questions can solely be answered by observing the image. No out of the box thinking was required. For example, given an image of a cat, and the question, “Can the animal in the image be domesticated?,” we want our method to combine features from the image with common sense knowledge (a cat can be domesticated). This calls for the development of a model which leverages external knowledge.
Funding
  • Acknowledgments: This material is based upon work supported in part by the National Science Foundation under Grant No 1718221 and Grant No 1563727, Samsung, 3M, IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR), Amazon Research Award, and AWS Machine Learning Research Award
Reference
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In ISWC/ASWC, 2007.
    Google ScholarFindings
  • H. Ben-younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for visual question answering. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP, 2013.
    Google ScholarFindings
  • J. Berant and P. Liang. Semantic parsing via paraphrasing. In ACL, 2014.
    Google ScholarLocate open access versionFindings
  • A. Bordes, S. Chopra, and J. Weston. Question answering with sub-graph embeddings. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large-scale simple question answering with memory networks. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly supervised embedding models. In ECML, 2014.
    Google ScholarLocate open access versionFindings
  • Q. Cai and A. Yates. Large-scale Semantic Parsing via Schema Matching and Lexicon Extension. In ACL, 2013.
    Google ScholarLocate open access versionFindings
  • A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In
    Google ScholarLocate open access versionFindings
  • A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv:1703.06585, 2017.
    Findings
  • L. Dong, F. Wei, M. Zhou, and K. Xu. Question answering over freebase with multi-column convolutional neural networks. In ACL, 2015.
    Google ScholarLocate open access versionFindings
  • A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. In KDD, 2014.
    Google ScholarLocate open access versionFindings
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? Dataset and Methods for Multilingual Image Question Answering. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Jabri, A. Joulin, and L. van der Maaten. Revisiting Visual Question Answering Baselines. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • U. Jain, Z. Zhang, and A. G. Schwing. Creativity: Generating Diverse Questions using Variational Autoencoders. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • A. Jiang, F. Wang, F. Porikli, and Y. Li. Compositional memory for visual question answering. arXiv:1511.05676, 2015.
    Findings
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • J.-H. Kim, S.-W. L. D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907, 2016.
    Findings
  • O. Kolomiyets and M.-F. Moens. A survey on question answering technology from an information retrieval perspective. In Information Sciences, 2011.
    Google ScholarLocate open access versionFindings
  • J. Krishnamurthy and T. Kollar. Jointly learning to parse and perceive: Connecting natural language to the physical world. In ACL, 2013.
    Google ScholarFindings
  • T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. Scaling semantic parsers with on-the-fly ontology matching. In EMNLP, 2013.
    Google ScholarFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
    Google ScholarFindings
  • Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang. Visual question generation as dual task of visual question answering. arXiv:1709.07192, 2017.
    Findings
  • P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Computational Linguistics, 2013.
    Google ScholarLocate open access versionFindings
  • J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neural network. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • M. Malinowski and M. Fritz. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv:1603.06059, 2016.
    Findings
  • K. Narasimhan, A. Yala, and R. Barzilay. Improving information extraction by acquiring external evidence with reinforcement learning. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • M. Narasimhan and A. G. Schwing. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • S. Reddy, O. Täckström, M. Collins, T. Kwiatkowski, D. Das, M. Steedman, and M. Lapata. Transforming dependency structures to logical forms for semantic parsing. In ACL, 2016.
    Google ScholarLocate open access versionFindings
  • M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In ESWC, 2018.
    Google ScholarLocate open access versionFindings
  • I. Schwartz, A. G. Schwing, and T. Hazan. High-Order Attention Models for Visual Question Answering. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • S. W. t. Yih, M.-W. Chang, X. He, and J. Gao. Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL-IJCNLP, 2015.
    Google ScholarLocate open access versionFindings
  • N. Tandon, G. de Melo, F. Suchanek, and G. Weikum. Webchild: Harvesting and organizing commonsense knowledge from the web. In WSDM, 2014.
    Google ScholarLocate open access versionFindings
  • C. Unger, L. Bühmann, J. Lehmann, A.-C. N. Ngomo, D. Gerber, and P. Cimiano. Template-based question answering over RDF data. In WWW, 2012.
    Google ScholarFindings
  • A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv:1610.02424, 2016.
    Findings
  • P. Wang, Q. Wu, C. Shen, A. Dick, and A. v. d. Hengel. Fvqa: Fact-based visual question answering. TPAMI, 2018.
    Google ScholarLocate open access versionFindings
  • P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Henge. Explicit knowledge-based reasoning for visual question answering. In IJCAI, 2017.
    Google ScholarLocate open access versionFindings
  • X. Wang, Y. Ye, and A. Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Q. Wu, C. Shen, A. van den Hengel, P. Wang, and A. Dick. Image captioning and visual question answering based on attributes and their related external knowledge. arXiv:1603.02814, 2016.
    Findings
  • Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • C. Xiao, M. Dymetman, and C. Gardent. Sequence-based structured prediction for semantic parsing. In ACL, 2016.
    Google ScholarLocate open access versionFindings
  • C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • X.Yao and B. V. Durme. Information extraction over structured data: Question answering with Freebase. In ACL, 2014.
    Google ScholarLocate open access versionFindings
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • L. Yu, E. Park, A. Berg, and T. Berg. Visual madlibs: Fill in the blank image generation and question answering. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • L. S. Zettlemoyer and M. Collins. Learning context-dependent mappings from sentences to logical form. In ACL, 2005.
    Google ScholarLocate open access versionFindings
  • L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI, 2005.
    Google ScholarLocate open access versionFindings
  • P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. arXiv:1511.05099, 2015.
    Findings
  • Y. Zhang, K. Liu, S. He, G. Ji, Z. Liu, H. Wu, and J. Zhao. Question answering over knowledge base with neural attention combining global knowledge information. arXiv:1606.00979, 2016.
    Findings
  • B. Zhou, Y. Tian, S. Sukhbataar, A. Szlam, and R. Fergus. Simple baseline for visual question answering. arXiv:1512.02167, 2015.
    Findings
  • Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Question Answering in Images. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei. Building a large-scale multimodal Knowledge Base for Visual Question Answering. In CoRR, 2015.
    Google ScholarLocate open access versionFindings
  • C. L. Zitnick, A. Agrawal, S. Antol, M. Mitchell, D. Batra, and D. Parikh. Measuring machine intelligence through visual question answering. AI Magazine, 2016.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments