We demonstrate that the class activation maps localization technique generalizes to other visual recognition tasks i.e., our technique produces generic localizable deep features that can aid other researchers in understanding the basis of discrimination used by convolutional neur...
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in , and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels. While this technique was previously proposed as a means for regularizing training, we find that i...更多
下载 PDF 全文
- Recent work by Zhou et al  has shown that the convolutional units of various layers of convolutional neural networks (CNNs) behave as object detectors despite no supervision on the location of the object was provided.
- Despite having this remarkable ability to localize objects in the convolutional layers, this ability is lost when fully-connected layers are used for classification.
- This tweaking allows identifying the discriminative image regions in a single forward-
- Recent work by Zhou et al  has shown that the convolutional units of various layers of convolutional neural networks (CNNs) behave as object detectors despite no supervision on the location of the object was provided
- We find that in most cases there is a small performance drop of 1 − 2% when removing the additional layers from the various networks
- In this work we propose a general technique called Class Activation Mapping (CAM) for convolutional neural network with global average pooling
- Class activation maps allow us to visualize the predicted class scores on any given image, highlighting the discriminative object parts detected by the convolutional neural network
- We evaluate our approach on weakly supervised object localization on the ILSVRC benchmark, demonstrating that our global average pooling convolutional neural network can perform accurate object localization
- Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation.We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task1
- We demonstrate that the class activation maps localization technique generalizes to other visual recognition tasks i.e., our technique produces generic localizable deep features that can aid other researchers in understanding the basis of discrimination used by convolutional neural network for their tasks
- GoogLeNet-GAP on full image GoogLeNet-GAP on crop GoogLeNet-GAP on BBox Alignments  Alignments  DPD  DeCAF+DPD  PANDA R-CNN .
- Given a set of images containing a common concept, the authors want to identify which regions the network recognizes as being important and if this corresponds to the input pattern.
- The authors follow a similar approach as before: the authors train a linear SVM on fc7 from AlexNet ave pool from GoogLeNet gap from GoogLeNet-GAP.
- The authors first report results on object classification to demonstrate that the approach does not significantly hurt classification performance.
- The authors add two convolutional layers just before GAP resulting in the AlexNet*-GAP network.
- Note that it is important for the networks to perform well on classification in order to achieve a high performance on localization as it involves identifying both the object category and the bounding box location accurately
- In this work the authors propose a general technique called Class Activation Mapping (CAM) for CNNs with global average pooling.
- This enables classification-trained CNNs to learn to perform object localization, without using any bounding box annotations.
- The authors demonstrate that the CAM localization technique generalizes to other visual recognition tasks i.e., the technique produces generic localizable deep features that can aid other researchers in understanding the basis of discrimination used by CNNs for their tasks
- Table1: Classification error on the ILSVRC validation set
- Table2: Localization error on the ILSVRC validation set. Backprop refers to using [<a class="ref-link" id="c23" href="#r23">23</a>] for localization instead of CAM
- Table3: Localization error on the ILSVRC test set for various weakly- and fully- supervised methods
- Table4: Fine-grained classification performance on CUB200 dataset. GoogLeNet-GAP can successfully localize important image crops, boosting classification performance
- Table5: Classification accuracy on representative scene and object datasets for different deep features
- Convolutional Neural Networks (CNNs) have led to impressive performance on a variety of visual recognition tasks [10, 35, 8]. Recent work has shown that despite being trained on image-level labels, CNNs have the remarkable ability to localize objects [1, 16, 2, 15, 18]. In this work, we show that, using an appropriate architecture, we can generalize this ability beyond just localizing objects, to start identifying exactly which regions of an image are being used for discrimination. Here, we discuss the two lines of work most related to this paper: weakly-supervised object localization and visualizing the internal representation of CNNs.
Weakly-supervised object localization: There have been a number of recent works exploring weaklysupervised object localization using CNNs [1, 16, 2, 15]. Bergamo et al  propose a technique for self-taught object localization involving masking out image regions to identify the regions causing the maximal activations in order to localize objects. Cinbis et al  and Pinheiro et al  combine multiple-instance learning with CNN features to localize objects. Oquab et al  propose a method for transferring mid-level image representations and show that some object localization can be achieved by evaluating the output of CNNs on multiple overlapping patches. However, the authors do not actually evaluate the localization ability. On the other hand, while these approaches yield promising results, they are not trained end-to-end and require multiple forward passes of a network to localize objects, making them difficult to scale to real-world datasets. Our approach is trained end-to-end and can localize objects in a single forward pass.
- This work was supported by NSF grant IIS-1524817, and by a Google faculty research award to A.T
- A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani. Self-taught object localization with deep networks. arXiv preprint arXiv:1409.3964, 2014. 1, 2
- R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2015. 1, 2
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. International Conference on Machine Learning, 2014. 5, 6
- A. Dosovitskiy and T. Brox. Inverting convolutional networks with convolutional networks. arXiv preprint arXiv:1506.02753, 2015. 2
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 2008. 5
- L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 2007. 5
- E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars. Local alignments for fine-grained categorization. Int’l Journal of Computer Vision, 2014. 6
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. CVPR, 2014. 1
- G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007. 5
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012. 1, 4
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Proc. CVPR, 2006. 5
- L.-J. Li and L. Fei-Fei. What, where and who? classifying events by scene and object recognition. Proc. ICCV, 2007. 5
- M. Lin, Q. Chen, and S. Yan. Network in network. International Conference on Learning Representations, 2014. 1, 2, 4
- A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. Proc. CVPR, 2015. 2
- M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. Proc. CVPR, 2014. 1, 2
- M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? weakly-supervised learning with convolutional neural networks. Proc. CVPR, 2015. 1, 2, 3
- G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. Proc. CVPR, 2012. 5
- P. O. Pinheiro and R. Collobert. From image-level to pixellevel labeling with convolutional networks. 2015. 1, 2
- A. Quattoni and A. Torralba. Recognizing indoor scenes. Proc. CVPR, 2009. 5
- A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. arXiv preprint arXiv:1403.6382, 2014. 5
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. In Int’l Journal of Computer Vision, 2015. 1, 3, 4
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. 5
- K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. International Conference on Learning Representations Workshop, 2014. 4, 5
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015. 4
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014. 1, 2, 4, 5
- K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. Proc. ICCV, 2011. 7
- P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical report, California Institute of Technology, 2010. 5
- J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. Proc. CVPR, 2010. 5, 7
- B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. FeiFei. Human action recognition by learning bases of action attributes and parts. Proc. ICCV, 2011. 5
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Proc. ECCV, 2014. 2, 3
- N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Partbased r-cnns for fine-grained category detection. Proc. ECCV, 2014. 6
- N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable part descriptors for fine-grained recognition and attribute prediction. Proc. ICCV, 2013. 6
- B. Zhou, V. Jagadeesh, and R. Piramuthu. Conceptlearner: Discovering visual concepts from weakly labeled image collections. Proc. CVPR, 2015. 7
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. International Conference on Learning Representations, 2015. 1, 2, 3, 8
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, 2014. 1, 5
- B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015. 7