AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We compare against two baseline feature extractors, one is hand-crafted kernel descriptors which was widely used in fine-grained classification before using Convolutional Neural Net feature, the other is the CNN feature extractor pre-trained from all the data in ILSVRC2012

The Application Of Two-Level Attention Models In Deep Convolutional Neural Network For Fine-Grained Image Classification

2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), (2015): 842-850

Cited: 756|Views169
EI
Full Text
Bibtex
Weibo

Abstract

Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foreground object or object parts (where) to extract discriminati...More

Code:

Data:

0
Introduction
  • Fine-grained classification is to recognize subordinatelevel categories under some basic-level category, e.g., classifying different bird types [22], dog breeds [11], flower species [15], aircraft models [14] etc.
  • This is an impor- Artic_Tern Caspian_Tern Common_Tern.
  • Illustration of the difficulty of fine-grained classification : large intra-class variance and small inter-class variance.
  • Even in the ILSVRC2012 1K categories, there are 118 and 59 categories under the dog and bird class, respectively.
Highlights
  • Fine-grained classification is to recognize subordinatelevel categories under some basic-level category, e.g., classifying different bird types [22], dog breeds [11], flower species [15], aircraft models [14] etc
  • We begin with a demonstration of the performance advantage of learning deep feature based on object level attention
  • We compare against two baseline feature extractors, one is hand-crafted kernel descriptors [3] (KDES) which was widely used in fine-grained classification before using Convolutional Neural Net (CNN) feature, the other is the CNN feature extractor pre-trained from all the data in ILSVRC2012 [16]
  • DomainNet based feature extractor achieves the best results on both pipelines. This further demonstrates that using object-level attention to filter relevant patches is an important condition for CNN to learn good features
  • This leads to better CNN feature for fine-grained classification, as the network is driven by domain-relevant patches that are rich with shift/scale variances
  • Our attention-based methods achieved significant improvement, and the two-level attention delivers even better results than using human labelled bounding box (69.7% vs. 68.4%), and is comparable to DPD (70.5%)
  • One important advantage of our method is that, the attention is derived from the CNN trained with classification task, it can be conducted under the weakest supervision setting where only class label is provided
Methods
  • The authors' design is based on a very simple intuition: performing fine-grained classification requires first to “see” the object and the most discriminative parts of it.
  • Finding a Chihuahua in an image entails the process of first seeing a dog, and focusing on its important features that tell it apart from other breeds of dog.
  • For this to work the classifier should not work on the raw image but rather its constitute patches.
  • Such patches should retain the most objectness that are relevant to the recognition steps.
  • The objectness of the first step is at the level of dog class, and that of the second step is at the parts that would differentiate Chihuahua from other breeds.
Results
  • Results on

    ILSVRC2012 Dog/Bird

    In this task, only image-level class labels are available.
  • Softmax outputs of 10 fixed views are averaged as the final prediction
  • In this method, no specific attention is used and patches are selected.
  • No specific attention is used and patches are selected
  • For this task, the authors begin with a demonstration of the performance advantage of learning deep feature based on object level attention.
  • Advantage on Learning Deep Feature The authors have shown that the bird DomainNet trained with object-level attention delivers superior classification performance on ILSVRC2012 Bird.
  • DomainNet based feature extractor achieves the best results on both pipelines
  • This further demonstrates that using object-level attention to filter relevant patches is an important condition for CNN to learn good features
Conclusion
  • The authors propose a fine-grained classification pipeline combining bottom-up and two top-down attentions.
  • The object-level attention feeds the network with patches relevant to the task domain with different views and scales
  • This leads to better CNN feature for fine-grained classification, as the network is driven by domain-relevant patches that are rich with shift/scale variances.
  • The part-level attention focuses on local discriminate patterns and achieves pose normalization.
  • One important advantage of the method is that, the attention is derived from the CNN trained with classification task, it can be conducted under the weakest supervision setting where only class label is provided.
  • This is in sharp contrast with other stateof-the-art methods that require object bounding box or part landmark to train or test.
  • To the best of the knowledge, the authors get the best accuracy on CUB200-2011 dataset under the weakest supervision setting
Tables
  • Table1: Top-1 error rate on ILSVRC2012 Dog/Bird validation set
  • Table2: Accuracy and Annotation used between methods
Download tables as Excel
Related work
  • Fine-grained classification has been extensively studied recently [21, 22, 11, 3, 5, 24, 27, 2, 4]. Previous works have aimed at boosting the recognition accuracy from three main aspects: 1. object and part localization, which can also be treated as object/part level attention; 2. feature representation for detected objects or parts; 3. human in the loop [20]. Since our goal is automatic fine-grained classification, we focus on the related work of the first two.

    4.1. Object/Part Level Attention

    In fine-grained classification tasks, discriminative features are mainly localized on foreground object and even on object parts, which makes object and part level attention be the first important step. As fine-grained classification datasets are often using detailed annotations of bounding box and part landmarks, most methods rely on some of these annotations to achieve object or part level attention.

    The strongest supervised setting is using bounding box and part landmarks in both training and testing phase, which is often used to test performance upbound [2]. To verify CNN features on fine-grained task, bounding boxes are assumed given in both training and testing phase [7, 16]. Using provided bounding box, several methods proposed to learn part detectors in unsupervised or latent manner [23, 5]. To further improve the performance, part level annotation is also used in training phase to learn strongly-supervised deformable part-based model [1, 27] or directly used to finetune pre-trained CNN [4].
Funding
  • This work was supported by National Natural Science Foundation of China under Grant 61371128, National HiTech Research and Development Program of China (863 Program) under Grant 2014AA015102, and Ph.D
  • Programs Foundation of Ministry of Education of China under Grant 20120001110097
Reference
  • H. Azizpour and I. Laptev. Object detection using stronglysupervised deformable part models. In ECCV. 2012.
    Google ScholarFindings
  • T. Berg and P. N. Belhumeur. POOF: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR, 2013.
    Google ScholarFindings
  • L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
    Findings
  • Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic segmentation and part localization for fine-grained categorization. In ICCV, 2013.
    Google ScholarFindings
  • M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. BING: Binarized normed gradients for objectness estimation at 300fps. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Technical report, 2013.
    Google ScholarFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. PAMI, 2010.
    Google ScholarLocate open access versionFindings
  • E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars. Fine-grained categorization by alignments. In ICCV, 2013.
    Google ScholarFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li. Dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • X. Li and C. G. M. Snoek. Classifying tag relevance with relevant positive and negative examples. In Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, October 2013.
    Google ScholarLocate open access versionFindings
  • S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
    Google ScholarFindings
  • M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
    Google ScholarLocate open access versionFindings
  • A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN Features off-the-shelf: an Astounding Baseline for Recognition. arXiv preprint arXiv:1403.6382, 2014.
    Findings
  • M. Simon, E. Rodner, and J. Denzler. Part detector discovery in deep convolutional neural networks. arXiv preprint arXiv:1411.3159, 2014.
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
    Google ScholarLocate open access versionFindings
  • C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recognition and part localization with humans in the loop. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset. 2011.
    Google ScholarLocate open access versionFindings
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD birds 200. 2010.
    Google ScholarFindings
  • S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In NIPS, pages 3122–3130, 2012.
    Google ScholarLocate open access versionFindings
  • B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and annotation-free approach for fine-grained image categorization. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013.
    Findings
  • N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Partbased r-cnns for fine-grained category detection. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable part descriptors for fine-grained recognition and attribute prediction. In ICCV, 2013.
    Google ScholarFindings
0
Your rating :

No Ratings

Tags
Comments
avatar
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn