AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose Visual Concept Reasoning Networks that efficiently capture the global context by reasoning over high-level visual concepts

Visual Concept Reasoning Networks

AAAI, pp.8172-8180, (2021)

Cited by: 0|Views526
EI
Full Text
Bibtex
Weibo

Abstract

A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to simultaneously learn representations with different visual concepts or properties. Dependencies or in...More

Code:

Data:

0
Introduction
  • Convolutional neural networks have shown notable success in visual recognition tasks by learning hierarchical representations.
  • It is known that the effective receptive field only covers a fraction of the theoretical size of it [17]
  • This eventually restricts convolutional neural networks to capture the global context based on long-range dependencies.
  • Most of convolutional neural networks are characterized by dense and local operations that take the advantage of the weight sharing property
  • It typically lacks internal mechanism for high-level reasoning based on abstract semantic concepts such those humans manipulate with natural language and inspired by modern theories of consciousness [3].
  • It is related to system 2 cognitive abilities, which include things like reasoning, planning, and imagination, that are assumed to capture the global context from interactions between a few abstract factors and give feedback to the local descriptor for decision-making
Highlights
  • Convolutional neural networks have shown notable success in visual recognition tasks by learning hierarchical representations
  • It typically lacks internal mechanism for high-level reasoning based on abstract semantic concepts such those humans manipulate with natural language and inspired by modern theories of consciousness [3]
  • We propose Visual Concept Reasoning Networks (VCRNet) that efficiently capture the global context by reasoning over high-level visual concepts
  • The results interestingly show that our models consistently perform better than other baseline networks except a network with Global reasoning module (GloRe), which are shown in Table 1, regardless of the type of concept sampler
  • Our proposed model precisely fits to a modularized multi-branch architecture by having split-transform-attend-interact-modulate-merge stages
  • The experimental results shows that it consistently outperforms other baseline models on multiple visual recognition tasks and only increases the number of parameters by less than 1%
Methods
  • The authors introduce the proposed model, Visual Concept Reasoning Network (VCRNet), and describe the overall architecture and its components in detail.
  • The proposed model is designed to reason over high-level visual concepts and modulate feature maps based on its result.
  • The authors especially take advantage of using a residual block of ResNeXt [28] that operates by grouped convolutions
  • This block is explicable by a split-transform-merge strategy and a highly modularized multi-branch architecture.
  • It has an additional dimension "cardinality" to define the number of branches used in the block.
  • The authors use this block by regarding each branch as a network learning representation of a specific visual concept and, refer to the cardinality as the number of visuals concepts C
Results
  • Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that the proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
  • The authors showcase the proposed model improves the performance more than other models by increasing the number of parameters by less than 1% on multiple visual recognition tasks.
  • The experimental results shows that it consistently outperforms other baseline models on multiple visual recognition tasks and only increases the number of parameters by less than 1%
Conclusion
  • The authors propose Visual Concept Reasoning Networks (VCRNet) that efficiently capture the global context by reasoning over high-level visual concepts.
  • The authors' proposed model precisely fits to a modularized multi-branch architecture by having split-transform-attend-interact-modulate-merge stages.
  • The experimental results shows that it consistently outperforms other baseline models on multiple visual recognition tasks and only increases the number of parameters by less than 1%.
  • The authors strongly believe research in these approaches will provide notable improvements on more difficult visual recognition tasks in the future.
  • The authors are looking forward to remove dense interactions between branches as possible to encourage more specialized concept-wise representation learning and improve the reasoning process.
  • The authors expect to have consistent visual concepts that are shared and updated over all stages in the network
Tables
  • Table1: Results of image classification on ImageNet validation set
  • Table2: Results of object detection and instance segmentation on COCO 2017 validation set
  • Table3: Results of scene recognition on Places-365 validation set
  • Table4: Results of action recognition on Kinetics-400 validation set
  • Table5: a) compares the performance of these approaches on the ImageNet image classification task. The attention-based approach with dynamic queries (dynamic attn) outperforms the others, and we assume that this is due to having more adaptive power than the others. Furthermore, the results interestingly show that our models consistently perform better than other baseline networks except a network with GloRe, which are shown in Table 1, regardless of the type of concept sampler. Ablation study on VCRNet
  • Table6: The effectiveness of BN on image classification
  • Table7: The effectiveness of BN on object detection and instance segmentation
Download tables as Excel
Related work
  • Multi-branch architectures are carefully designed with multiple branches characterized by different dense operations, and split-transform-merge stages are used as the building blocks. The Inception models [21] are one of the successful multi-branch architectures that define branches with different scales to handle multiple scales. ResNeXt [28] is another version of ResNet[10] having multiple branches with the same topology in residual blocks, and it is efficiently implemented by grouped convolutions. In this work, we utilize this residual block and associate each branch of it with a visual concept.

    There have been several works to adaptively modulate the feature maps based on the external context or the global context of input data. Squeeze-and-Excitation networks (SE) [12] use gating mechanism to do channel-wise re-scaling in accordance with the channel dependencies based on the global context. Gather-Excite networks (GE) [11] further re-scale locally that it is able to finely redistribute the global context to the local descriptors. Convolutional block attention module (CBAM) [26] independently and sequentially has channel-wise and spatial-wise gating networks to modulate the feature maps. All these approaches extract the global context by using the global average pooling that equally attends all local positions. Dynamic layer normalization (DLN) [14] and Feature-wise Linear Modulation (FiLM) [18] present a method of feature modulation on normalization layers by conditioning on the global context and the external context, respectively.
Funding
  • Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%
  • • We showcase our proposed model improves the performance more than other models by increasing the number of parameters by less than 1% on multiple visual recognition tasks
  • We insert Dropout [20] layers in residual blocks with p = 0.02 to avoid some over-fitting
  • The experimental results shows that it consistently outperforms other baseline models on multiple visual recognition tasks and only increases the number of parameters by less than 1%
Reference
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
    Google ScholarLocate open access versionFindings
  • I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le. Attention augmented convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • Y. Bengio. The consciousness prior. CoRR, abs/1709.08568, 2017.
    Findings
  • Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 1971–1980, 2019.
    Google ScholarLocate open access versionFindings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017.
    Google ScholarLocate open access versionFindings
  • D. M. Chan, R. Rao, F. Huang, and J. F. Canny. Gpu accelerated t-distributed stochastic neighbor embedding. Journal of Parallel and Distributed Computing, 131:1–13, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis. Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 433–442, 2019.
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6201–6210, 2019.
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9401–94Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
    Findings
  • T. Kim, I. Song, and Y. Bengio. Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition. In F. Lacerda, editor, Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pages 2411–2415. ISCA, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, European Conference Computer Vision ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations, ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep convolutional neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4898–4906. Curran Associates, Inc., 2016.
    Google ScholarLocate open access versionFindings
  • E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
    Google ScholarLocate open access versionFindings
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
    Google ScholarLocate open access versionFindings
  • S. Woo, J. Park, J. Lee, and I. S. Kweon. CBAM: convolutional block attention module. In European Conference Computer Vision ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
    Findings
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • S. Zhang, X. He, and S. Yan. LatentGNN: Learning efficient non-local relations for visual recognition. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7374–7383, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科