AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
The Application Of Two-Level Attention Models In Deep Convolutional Neural Network For Fine-Grained Image Classification
2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), (2015): 842-850
- Fine-grained classification is to recognize subordinatelevel categories under some basic-level category, e.g., classifying different bird types , dog breeds , flower species , aircraft models  etc.
- This is an impor- Artic_Tern Caspian_Tern Common_Tern.
- Illustration of the difficulty of fine-grained classification : large intra-class variance and small inter-class variance.
- Even in the ILSVRC2012 1K categories, there are 118 and 59 categories under the dog and bird class, respectively.
- Fine-grained classification is to recognize subordinatelevel categories under some basic-level category, e.g., classifying different bird types , dog breeds , flower species , aircraft models  etc
- We begin with a demonstration of the performance advantage of learning deep feature based on object level attention
- We compare against two baseline feature extractors, one is hand-crafted kernel descriptors  (KDES) which was widely used in fine-grained classification before using Convolutional Neural Net (CNN) feature, the other is the CNN feature extractor pre-trained from all the data in ILSVRC2012 
- DomainNet based feature extractor achieves the best results on both pipelines. This further demonstrates that using object-level attention to filter relevant patches is an important condition for CNN to learn good features
- This leads to better CNN feature for fine-grained classification, as the network is driven by domain-relevant patches that are rich with shift/scale variances
- Our attention-based methods achieved significant improvement, and the two-level attention delivers even better results than using human labelled bounding box (69.7% vs. 68.4%), and is comparable to DPD (70.5%)
- One important advantage of our method is that, the attention is derived from the CNN trained with classification task, it can be conducted under the weakest supervision setting where only class label is provided
- The authors' design is based on a very simple intuition: performing fine-grained classification requires first to “see” the object and the most discriminative parts of it.
- Finding a Chihuahua in an image entails the process of first seeing a dog, and focusing on its important features that tell it apart from other breeds of dog.
- For this to work the classifier should not work on the raw image but rather its constitute patches.
- Such patches should retain the most objectness that are relevant to the recognition steps.
- The objectness of the first step is at the level of dog class, and that of the second step is at the parts that would differentiate Chihuahua from other breeds.
- Results on
In this task, only image-level class labels are available.
- Softmax outputs of 10 fixed views are averaged as the final prediction
- In this method, no specific attention is used and patches are selected.
- No specific attention is used and patches are selected
- For this task, the authors begin with a demonstration of the performance advantage of learning deep feature based on object level attention.
- Advantage on Learning Deep Feature The authors have shown that the bird DomainNet trained with object-level attention delivers superior classification performance on ILSVRC2012 Bird.
- DomainNet based feature extractor achieves the best results on both pipelines
- This further demonstrates that using object-level attention to filter relevant patches is an important condition for CNN to learn good features
- The authors propose a fine-grained classification pipeline combining bottom-up and two top-down attentions.
- The object-level attention feeds the network with patches relevant to the task domain with different views and scales
- This leads to better CNN feature for fine-grained classification, as the network is driven by domain-relevant patches that are rich with shift/scale variances.
- The part-level attention focuses on local discriminate patterns and achieves pose normalization.
- One important advantage of the method is that, the attention is derived from the CNN trained with classification task, it can be conducted under the weakest supervision setting where only class label is provided.
- This is in sharp contrast with other stateof-the-art methods that require object bounding box or part landmark to train or test.
- To the best of the knowledge, the authors get the best accuracy on CUB200-2011 dataset under the weakest supervision setting
- Table1: Top-1 error rate on ILSVRC2012 Dog/Bird validation set
- Table2: Accuracy and Annotation used between methods
- Fine-grained classification has been extensively studied recently [21, 22, 11, 3, 5, 24, 27, 2, 4]. Previous works have aimed at boosting the recognition accuracy from three main aspects: 1. object and part localization, which can also be treated as object/part level attention; 2. feature representation for detected objects or parts; 3. human in the loop . Since our goal is automatic fine-grained classification, we focus on the related work of the first two.
4.1. Object/Part Level Attention
In fine-grained classification tasks, discriminative features are mainly localized on foreground object and even on object parts, which makes object and part level attention be the first important step. As fine-grained classification datasets are often using detailed annotations of bounding box and part landmarks, most methods rely on some of these annotations to achieve object or part level attention.
The strongest supervised setting is using bounding box and part landmarks in both training and testing phase, which is often used to test performance upbound . To verify CNN features on fine-grained task, bounding boxes are assumed given in both training and testing phase [7, 16]. Using provided bounding box, several methods proposed to learn part detectors in unsupervised or latent manner [23, 5]. To further improve the performance, part level annotation is also used in training phase to learn strongly-supervised deformable part-based model [1, 27] or directly used to finetune pre-trained CNN .
- This work was supported by National Natural Science Foundation of China under Grant 61371128, National HiTech Research and Development Program of China (863 Program) under Grant 2014AA015102, and Ph.D
- Programs Foundation of Ministry of Education of China under Grant 20120001110097
- H. Azizpour and I. Laptev. Object detection using stronglysupervised deformable part models. In ECCV. 2012.
- T. Berg and P. N. Belhumeur. POOF: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR, 2013.
- L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. In NIPS, 2010.
- S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
- Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic segmentation and part localization for fine-grained categorization. In ICCV, 2013.
- M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. BING: Binarized normed gradients for objectness estimation at 300fps. In CVPR, 2014.
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Technical report, 2013.
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. PAMI, 2010.
- E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars. Fine-grained categorization by alignments. In ICCV, 2013.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li. Dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, CVPR, 2011.
- A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012.
- X. Li and C. G. M. Snoek. Classifying tag relevance with relevant positive and negative examples. In Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, October 2013.
- S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
- M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
- A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN Features off-the-shelf: an Astounding Baseline for Recognition. arXiv preprint arXiv:1403.6382, 2014.
- M. Simon, E. Rodner, and J. Denzler. Part detector discovery in deep convolutional neural networks. arXiv preprint arXiv:1411.3159, 2014.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
- C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recognition and part localization with humans in the loop. In ICCV, 2011.
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset. 2011.
- P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD birds 200. 2010.
- S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In NIPS, pages 3122–3130, 2012.
- B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In CVPR, 2012.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013.
- N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Partbased r-cnns for fine-grained category detection. In ECCV. 2014.
- N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable part descriptors for fine-grained recognition and attribute prediction. In ICCV, 2013.