Iterative Visual Reasoning Beyond Convolutions

CVPR, pp. 7239-7248, 2018.

Cited by: 95|Bibtex|Views289|DOI:https://doi.org/10.1109/cvpr.2018.00756
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We presented a novel framework for iterative visual reasoning

Abstract:

We present a novel framework for iterative visual reasoning. Our framework goes beyond current recognition systems that lack the capability to reason beyond stack of convolutions. The framework consists of two core modules: a local module that uses spatial memory [4] to store previous beliefs with parallel updates; and a global graph-reas...More

Code:

Data:

0
Introduction
  • The authors have made significant advances in standard recognition tasks such as image classification [16], detection [37] or segmentation [3].
  • Most of these gains are a result of using feed-forward end-to-end learned ConvNet models.
  • An example of spatial-semantic reasoning could be: recognition of a “car” on road should help in recognizing the “person” inside “driving” the “car”
Highlights
  • In recent years, we have made significant advances in standard recognition tasks such as image classification [16], detection [37] or segmentation [3]
  • Unlike humans where visual reasoning about the space and semantics is crucial [1], our current visual systems lack any context reasoning beyond convolutions with large receptive fields
  • We present our key contribution – a global module that reasons directly between regions and classes represented as nodes in a graph (Sec. 3.2)
  • Satisfying all these constraints, ADE [55] and Visual Genome (VG) [25] where regions are densely labeled in open vocabulary are the main picks of our study
  • Quantitative results on ADE test-1k and Visual Genome test are shown in Tab. 1
  • We presented a novel framework for iterative visual reasoning
Methods
  • The authors evaluate the effectiveness of the framework. The authors begin with the experimental setups, which includes the datasets to work with (Sec. 4.1), the task to evaluate on (Sec. 4.2) and details of the implementation (Sec. 4.3).
  • One benefit of using knowledge graph is to transfer across classes, a dataset with long-tail distribution is an ideal test-bed.
  • Satisfying all these constraints, ADE [55] and Visual Genome (VG) [25] where regions are densely labeled in open vocabulary are the main picks of the study
Results
  • Quantitative results on ADE test-1k and VG test are shown in Tab. 1.
  • To check whether the performance gain is a result of more parameters, the authors include model ensemble as the third baseline where the prediction of two separate baseline models are averaged.
  • The authors' reasoning modules are performing much better than all the baselines on ADE.
  • The local module alone can increase per-class AP by 7.8 absolute points.
  • The global module alone is not as effective (4.4%
Conclusion
  • The authors presented a novel framework for iterative visual reasoning.
  • It uses a graph to encode spatial and semantic relationships between regions and classes and passes message on the graph.
  • Analysis shows that the reasoning framework is resilient to missing regions caused by current region proposal approaches.
  • XC would like to thank Shengyang Dai and Google Cloud AI team for support during the internship
Summary
  • Introduction:

    The authors have made significant advances in standard recognition tasks such as image classification [16], detection [37] or segmentation [3].
  • Most of these gains are a result of using feed-forward end-to-end learned ConvNet models.
  • An example of spatial-semantic reasoning could be: recognition of a “car” on road should help in recognizing the “person” inside “driving” the “car”
  • Methods:

    The authors evaluate the effectiveness of the framework. The authors begin with the experimental setups, which includes the datasets to work with (Sec. 4.1), the task to evaluate on (Sec. 4.2) and details of the implementation (Sec. 4.3).
  • One benefit of using knowledge graph is to transfer across classes, a dataset with long-tail distribution is an ideal test-bed.
  • Satisfying all these constraints, ADE [55] and Visual Genome (VG) [25] where regions are densely labeled in open vocabulary are the main picks of the study
  • Results:

    Quantitative results on ADE test-1k and VG test are shown in Tab. 1.
  • To check whether the performance gain is a result of more parameters, the authors include model ensemble as the third baseline where the prediction of two separate baseline models are averaged.
  • The authors' reasoning modules are performing much better than all the baselines on ADE.
  • The local module alone can increase per-class AP by 7.8 absolute points.
  • The global module alone is not as effective (4.4%
  • Conclusion:

    The authors presented a novel framework for iterative visual reasoning.
  • It uses a graph to encode spatial and semantic relationships between regions and classes and passes message on the graph.
  • Analysis shows that the reasoning framework is resilient to missing regions caused by current region proposal approaches.
  • XC would like to thank Shengyang Dai and Google Cloud AI team for support during the internship
Tables
  • Table1: Main results on ADE test-1k and VG test. AP is average precision, AC is classification accuracy. Superscripts show the improvement ∇ over the baseline
  • Table2: Ablative analysis on ADE test-1k. In the first row of each block we repeat Local, Global and Final results from Tab. 1
  • Table3: Results with missing regions when region proposals are used. COCO minival is used since it is more detection oriented. pre filters regions before inference, and post filters after inference
Download tables as Excel
Related work
  • Visual Knowledge Base. Whereas past five years in computer vision will probably be remembered as the successful resurgence of neural networks, acquiring visual knowledge at a large scale – the simplest form being labeled instances of objects [39, 30], scenes [55], relationships [25]

    etc.– deserves at least half the credit, since ConvNets hinge on large datasets [44]. Apart from providing labels using crowd-sourcing, attempts have also been made to accumulate structured knowledge (e.g. relationships [5], ngrams [10]) automatically from the web. However, these works fixate on building knowledge bases rather than using knowledge for reasoning. Our framework, while being more general, is along the line of research that applies visual knowledge base to end tasks, such as affordances [56], image classification [32], or question answering [49]. Context Modeling. Modeling context, or the interplay between scenes, objects and parts is one of the central problems in computer vision. While various previous work (e.g. scene-level reasoning [46], attributes [13, 36], structured prediction [24, 9, 47], relationship graph [21, 31, 52]) has approached this problem from different angles, the breakthrough comes from the idea of feature learning with ConvNets [16]. On the surface, such models hardly use any explicit context module for reasoning, but it is generally accepted that ConvNets are extremely effective in aggregating local pixel-to-level context through its ever-growing receptive fields [54]. Even the most recent developments such as top-down module [50, 29, 43], pairwise module [40], iterative feedback [48, 34, 2], attention [53], and memory [51, 4] are motivated to leverage such power and depend on variants of convolutions for reasoning. Our work takes an important next step beyond those approaches in that it also incorporates learning from structured visual knowledge bases directly to reason with spatial and semantic relationships. Relational Reasoning. The earliest form of reasoning in artificial intelligence dates back to symbolic approaches [33], where relations between abstract symbols are defined by the language of mathematics and logic, and reasoning takes place by deduction, abduction [18], etc. However, symbols need to be grounded [15] before such systems are practically useful. Modern approaches, such as path ranking algorithm [26], rely on statistical learning to extract useful patterns to perform relational reasoning on structured knowledge bases. As an active research area, there are recent works also applying neural networks to the graph structured data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize the output of networks with relationships [8] and knowledge bases [20]. However, we believe for visual data, reasoning should be both local and global: discarding the twodimensional image structure is neither efficient nor effective for tasks that involve regions.
Funding
  • Acknowledgements: This work was supported in part by ONR MURI N000141612007
Reference
  • I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982. 1
    Google ScholarLocate open access versionFindings
  • J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • X. Chen and A. Gupta. Spatial memory for context reasoning in object detection. arXiv preprint arXiv:1704.04224, 2017. 1, 2, 3, 6
    Findings
  • X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledge from web data. In ICCV, 2013. 2, 4
    Google ScholarLocate open access versionFindings
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. 3
    Findings
  • R. Das, A. Neelakantan, D. Belanger, and A. McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. arXiv preprint arXiv:1607.01426, 2016. 2
    Findings
  • J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. IJCV, 95(1):1–12, 2011. 2
    Google ScholarLocate open access versionFindings
  • S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert. An empirical study of context in object detection. In CVPR, 2009. 2
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 4, 6
    Google ScholarLocate open access versionFindings
  • A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 2
    Google ScholarLocate open access versionFindings
  • L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. TPAMI, 28(4):594–611, 2006. 2
    Google ScholarLocate open access versionFindings
  • S. Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990. 2
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 201, 2, 4, 6
    Google ScholarLocate open access versionFindings
  • M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015. 2
    Findings
  • J. R. Hobbs, M. Stickel, P. Martin, and D. Edwards. Interpretation as abduction. In ACL, 1988. 2
    Google ScholarLocate open access versionFindings
  • D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. ECCV, 2012. 6
    Google ScholarLocate open access versionFindings
  • Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing. Deep neural networks with massive learned knowledge. In EMNLP, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jegou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016. 4, 6
    Findings
  • T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. 2
    Findings
  • P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011. 2
    Google ScholarLocate open access versionFindings
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332, 2016. 2, 5
    Findings
  • N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011. 2
    Google ScholarLocate open access versionFindings
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. 2
    Findings
  • D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology, 58(7):1019–1031, 2007. 6
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv:1612.03144, 2016. 2, 7
    Findings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 6
    Google ScholarLocate open access versionFindings
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for image classification. arXiv preprint arXiv:1612.04844, 2016. 1, 2, 4
    Findings
  • A. Newell. Physical symbol systems. Cognitive science, 4(2):135–183, 1980. 2
    Google ScholarLocate open access versionFindings
  • A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 2
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv:1506.01497, 2015. 1, 6, 8
    Findings
  • O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 1
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015. 2, 4, 5, 6
    Google ScholarLocate open access versionFindings
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017. 2
    Findings
  • B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Itembased collaborative filtering recommendation algorithms. In WWW, 2001. 6
    Google ScholarLocate open access versionFindings
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. TNN, 20(1):61–80, 2009. 2, 4
    Google ScholarLocate open access versionFindings
  • A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond Skip Connections: Top-Down Modulation for Object Detection. arXiv:1612.06851, 2016. 2
    Findings
  • C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, 2011. 5
    Google ScholarLocate open access versionFindings
  • A. Torralba, K. P. Murphy, W. T. Freeman, M. A. Rubin, et al. Context-based vision system for place and object recognition. In ICCV, 2003. 2
    Google ScholarFindings
  • Z. Tu and X. Bai. Auto-context and its application to highlevel vision tasks and 3d brain image segmentation. TPAMI, 32(10):1744–1757, 2010. 2
    Google ScholarLocate open access versionFindings
  • S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • S. Xie, X. Huang, and Z. Tu. Top-down learning for structured labeling with convolutional pseudoprior. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. arXiv, 1603, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. arXiv preprint arXiv:1701.02426, 2017. 2
    Findings
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv preprint arXiv:1608.05442, 2016. 1, 2, 5
    Findings
  • Y. Zhu, C. Zhang, C. Re, and L. Fei-Fei. Building a largescale multimodal knowledge base system for answering visual queries. arXiv:1507.05670, 2015. 2
    Findings
Full Text
Your rating :
0

 

Tags
Comments