Iterative Visual Reasoning Beyond Convolutions
CVPR, pp. 7239-7248, 2018.
EI
Weibo:
Abstract:
We present a novel framework for iterative visual reasoning. Our framework goes beyond current recognition systems that lack the capability to reason beyond stack of convolutions. The framework consists of two core modules: a local module that uses spatial memory [4] to store previous beliefs with parallel updates; and a global graph-reas...More
Code:
Data:
Introduction
- The authors have made significant advances in standard recognition tasks such as image classification [16], detection [37] or segmentation [3].
- Most of these gains are a result of using feed-forward end-to-end learned ConvNet models.
- An example of spatial-semantic reasoning could be: recognition of a “car” on road should help in recognizing the “person” inside “driving” the “car”
Highlights
- In recent years, we have made significant advances in standard recognition tasks such as image classification [16], detection [37] or segmentation [3]
- Unlike humans where visual reasoning about the space and semantics is crucial [1], our current visual systems lack any context reasoning beyond convolutions with large receptive fields
- We present our key contribution – a global module that reasons directly between regions and classes represented as nodes in a graph (Sec. 3.2)
- Satisfying all these constraints, ADE [55] and Visual Genome (VG) [25] where regions are densely labeled in open vocabulary are the main picks of our study
- Quantitative results on ADE test-1k and Visual Genome test are shown in Tab. 1
- We presented a novel framework for iterative visual reasoning
Methods
- The authors evaluate the effectiveness of the framework. The authors begin with the experimental setups, which includes the datasets to work with (Sec. 4.1), the task to evaluate on (Sec. 4.2) and details of the implementation (Sec. 4.3).
- One benefit of using knowledge graph is to transfer across classes, a dataset with long-tail distribution is an ideal test-bed.
- Satisfying all these constraints, ADE [55] and Visual Genome (VG) [25] where regions are densely labeled in open vocabulary are the main picks of the study
Results
- Quantitative results on ADE test-1k and VG test are shown in Tab. 1.
- To check whether the performance gain is a result of more parameters, the authors include model ensemble as the third baseline where the prediction of two separate baseline models are averaged.
- The authors' reasoning modules are performing much better than all the baselines on ADE.
- The local module alone can increase per-class AP by 7.8 absolute points.
- The global module alone is not as effective (4.4%
Conclusion
- The authors presented a novel framework for iterative visual reasoning.
- It uses a graph to encode spatial and semantic relationships between regions and classes and passes message on the graph.
- Analysis shows that the reasoning framework is resilient to missing regions caused by current region proposal approaches.
- XC would like to thank Shengyang Dai and Google Cloud AI team for support during the internship
Summary
Introduction:
The authors have made significant advances in standard recognition tasks such as image classification [16], detection [37] or segmentation [3].- Most of these gains are a result of using feed-forward end-to-end learned ConvNet models.
- An example of spatial-semantic reasoning could be: recognition of a “car” on road should help in recognizing the “person” inside “driving” the “car”
Methods:
The authors evaluate the effectiveness of the framework. The authors begin with the experimental setups, which includes the datasets to work with (Sec. 4.1), the task to evaluate on (Sec. 4.2) and details of the implementation (Sec. 4.3).- One benefit of using knowledge graph is to transfer across classes, a dataset with long-tail distribution is an ideal test-bed.
- Satisfying all these constraints, ADE [55] and Visual Genome (VG) [25] where regions are densely labeled in open vocabulary are the main picks of the study
Results:
Quantitative results on ADE test-1k and VG test are shown in Tab. 1.- To check whether the performance gain is a result of more parameters, the authors include model ensemble as the third baseline where the prediction of two separate baseline models are averaged.
- The authors' reasoning modules are performing much better than all the baselines on ADE.
- The local module alone can increase per-class AP by 7.8 absolute points.
- The global module alone is not as effective (4.4%
Conclusion:
The authors presented a novel framework for iterative visual reasoning.- It uses a graph to encode spatial and semantic relationships between regions and classes and passes message on the graph.
- Analysis shows that the reasoning framework is resilient to missing regions caused by current region proposal approaches.
- XC would like to thank Shengyang Dai and Google Cloud AI team for support during the internship
Tables
- Table1: Main results on ADE test-1k and VG test. AP is average precision, AC is classification accuracy. Superscripts show the improvement ∇ over the baseline
- Table2: Ablative analysis on ADE test-1k. In the first row of each block we repeat Local, Global and Final results from Tab. 1
- Table3: Results with missing regions when region proposals are used. COCO minival is used since it is more detection oriented. pre filters regions before inference, and post filters after inference
Related work
- Visual Knowledge Base. Whereas past five years in computer vision will probably be remembered as the successful resurgence of neural networks, acquiring visual knowledge at a large scale – the simplest form being labeled instances of objects [39, 30], scenes [55], relationships [25]
etc.– deserves at least half the credit, since ConvNets hinge on large datasets [44]. Apart from providing labels using crowd-sourcing, attempts have also been made to accumulate structured knowledge (e.g. relationships [5], ngrams [10]) automatically from the web. However, these works fixate on building knowledge bases rather than using knowledge for reasoning. Our framework, while being more general, is along the line of research that applies visual knowledge base to end tasks, such as affordances [56], image classification [32], or question answering [49]. Context Modeling. Modeling context, or the interplay between scenes, objects and parts is one of the central problems in computer vision. While various previous work (e.g. scene-level reasoning [46], attributes [13, 36], structured prediction [24, 9, 47], relationship graph [21, 31, 52]) has approached this problem from different angles, the breakthrough comes from the idea of feature learning with ConvNets [16]. On the surface, such models hardly use any explicit context module for reasoning, but it is generally accepted that ConvNets are extremely effective in aggregating local pixel-to-level context through its ever-growing receptive fields [54]. Even the most recent developments such as top-down module [50, 29, 43], pairwise module [40], iterative feedback [48, 34, 2], attention [53], and memory [51, 4] are motivated to leverage such power and depend on variants of convolutions for reasoning. Our work takes an important next step beyond those approaches in that it also incorporates learning from structured visual knowledge bases directly to reason with spatial and semantic relationships. Relational Reasoning. The earliest form of reasoning in artificial intelligence dates back to symbolic approaches [33], where relations between abstract symbols are defined by the language of mathematics and logic, and reasoning takes place by deduction, abduction [18], etc. However, symbols need to be grounded [15] before such systems are practically useful. Modern approaches, such as path ranking algorithm [26], rely on statistical learning to extract useful patterns to perform relational reasoning on structured knowledge bases. As an active research area, there are recent works also applying neural networks to the graph structured data [42, 17, 27, 23, 35, 7, 32], or attempting to regularize the output of networks with relationships [8] and knowledge bases [20]. However, we believe for visual data, reasoning should be both local and global: discarding the twodimensional image structure is neither efficient nor effective for tasks that involve regions.
Funding
- Acknowledgements: This work was supported in part by ONR MURI N000141612007
Reference
- I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982. 1
- J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, 2016. 2
- L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016. 1, 2, 5
- X. Chen and A. Gupta. Spatial memory for context reasoning in object detection. arXiv preprint arXiv:1704.04224, 2017. 1, 2, 3, 6
- X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledge from web data. In ICCV, 2013. 2, 4
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. 3
- R. Das, A. Neelakantan, D. Belanger, and A. McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. arXiv preprint arXiv:1607.01426, 2016. 2
- J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV, 2014. 2
- C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. IJCV, 95(1):1–12, 2011. 2
- S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, 2014. 2
- S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert. An empirical study of context in object detection. In CVPR, 2009. 2
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 4, 6
- A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 2
- L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. TPAMI, 28(4):594–611, 2006. 2
- S. Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990. 2
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 201, 2, 4, 6
- M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015. 2
- J. R. Hobbs, M. Stickel, P. Martin, and D. Edwards. Interpretation as abduction. In ACL, 1988. 2
- D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. ECCV, 2012. 6
- Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing. Deep neural networks with massive learned knowledge. In EMNLP, 2016. 2
- J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015. 2
- A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jegou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016. 4, 6
- T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. 2
- P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011. 2
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332, 2016. 2, 5
- N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011. 2
- Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. 2
- D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology, 58(7):1019–1031, 2007. 6
- T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv:1612.03144, 2016. 2, 7
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 6
- C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016. 2
- K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for image classification. arXiv preprint arXiv:1612.04844, 2016. 1, 2, 4
- A. Newell. Physical symbol systems. Cognitive science, 4(2):135–183, 1980. 2
- A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 2
- M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016. 2, 4
- D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 2
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv:1506.01497, 2015. 1, 6, 8
- O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 1
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015. 2, 4, 5, 6
- A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017. 2
- B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Itembased collaborative filtering recommendation algorithms. In WWW, 2001. 6
- F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. TNN, 20(1):61–80, 2009. 2, 4
- A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond Skip Connections: Top-Down Modulation for Object Detection. arXiv:1612.06851, 2016. 2
- C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In ICCV, 2017. 2
- A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, 2011. 5
- A. Torralba, K. P. Murphy, W. T. Freeman, M. A. Rubin, et al. Context-based vision system for place and object recognition. In ICCV, 2003. 2
- Z. Tu and X. Bai. Auto-context and its application to highlevel vision tasks and 3d brain image segmentation. TPAMI, 32(10):1744–1757, 2010. 2
- S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016. 1, 2
- Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, 2016. 2
- S. Xie, X. Huang, and Z. Tu. Top-down learning for structured labeling with convolutional pseudoprior. In ECCV, 2016. 2
- C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. arXiv, 1603, 2016. 1, 2
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. arXiv preprint arXiv:1701.02426, 2017. 2
- Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016. 2
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 2
- B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv preprint arXiv:1608.05442, 2016. 1, 2, 5
- Y. Zhu, C. Zhang, C. Re, and L. Fei-Fei. Building a largescale multimodal knowledge base system for answering visual queries. arXiv:1507.05670, 2015. 2
Full Text
Tags
Comments