AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented a novel latent variable based model, Latent Embeddings, for learning a nonlinear compatibility function for the task of zero-shot classification

Latent Embeddings For Zero-Shot Classification

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), no. 1 (2016): 69-77

Cited: 611|Views141
EI

Abstract

We present a novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification. The proposed method augments the state-of-the-art bilinear compatibility model by incorporating latent variables. Instead of learning a single bilinear map, it learns a collection ...More

Code:

Data:

0
Introduction
  • Zero-shot classification [14, 20, 21, 30, 41] is a challenging problem. The task is generally set as follows: training images are provided for certain visual classes and the classifier is expected to predict the presence or absence of novel classes at test time.
  • In fine grained image collections, images that belong to different classes are visually similar to each other, e.g. different bird species
  • Image labeling for such collections is a costly process, as it requires either expert opinion or a large number of attributes.
  • The class embeddings reflect the common and distinguishing properties of different classes using sideinformation that is extracted independently of images
  • Using these embeddings, the compatibility can be computed even with those unknown classes which have no corresponding images in the training set.
  • The compatibility function takes a simple form,
Highlights
  • Zero-shot classification [14, 20, 21, 30, 41] is a challenging problem
  • Interpretability of latent embeddings In Sec. 5.1, we have demonstrated that our novel latent embedding method improves the state-of-the-art of zeroshot classification on two fine-grained datasets of birds and dogs, i.e. Caltech-UCSD Birds (CUB) and Dogs, and one dataset of Animals, i.e. Animals With Attributes (AWA)
  • We presented a novel latent variable based model, Latent Embeddings (LatEm), for learning a nonlinear compatibility function for the task of zero-shot classification
  • We proposed a ranking based objective to learn the model using an efficient and scalable stochastic gradient descent (SGD) based solver
  • We improved the state-of-the-art for zero-shot learning using unsupervised class embeddings on AWA up to 66.2%and on two fine-grained datasets, achieving 34.9% accuracy on CUB as well as achieving 36.3% accuracy on Dogs with word2vec
  • K is initially set to be 16, and at every fifth epoch during training, we prune all those matrices that support less than 5% of the data points
  • On AWA, we improve the accuracy obtained with supervised class embeddings, obtaining 76.1%
Methods
  • The authors evaluate the proposed model on three challenging publicly available datasets of Birds, Dogs and Animals.
  • Caltech-UCSD Birds (CUB), Stanford Dogs (Dogs) are standard benchmarks of fine-grained recognition [7, 5, 37, 18] and Animals With Attributes (AWA) is another popular and challenging benchmark dataset [20].
  • All these three datasets have been used for zero-shot learning [2, 30, 17, 41].
Results
  • The authors empirically demonstrate that the model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting.
  • Each sampled training examples chooses one of the matrices for scoring – the authors keep track of this information and build a histogram over the number of matrices counting how many times each matrix was chosen by any training example
  • This is done by increasing the counter for Wj∗ by 1 after step 6 of Algorithm 1.
  • On AWA, the authors improve the accuracy obtained with supervised class embeddings, obtaining 76.1%
Conclusion
  • LatEm builds on the idea of Structured Joint

    Embeddings (SJE) [2]. The authors discuss below the differences between LatEm and SJE and emphasize the technical contributions.

    LatEm learns a piecewise linear compatibility function through multiple Wi matrices whereas SJE [2] is linear.
  • The task of discrimination against different bird species would be handled only by the second one, which would arguably be easier.The authors presented a novel latent variable based model, Latent Embeddings (LatEm), for learning a nonlinear compatibility function for the task of zero-shot classification.
  • On AWA, the authors improve the accuracy obtained with supervised class embeddings, obtaining 76.1%
  • This demonstrates quantitatively that the method learns a latent structure in the embedding space through multiple matrices.
  • Such pruning based method speeds the training up and leads to models with competitive space-time complexities cf. the cross-validation based method
Tables
  • Table1: The statistics of the three datasets used. CUB and Dog are fine-grained datasets whereas AWA is a more general concept dataset
  • Table2: Comparison of Latent Embeddings (LatEm) method with the state-of-the-art SJE [<a class="ref-link" id="c2" href="#r2">2</a>] method. We report average per-class Top-1 accuracy on unseen classes. We use the same data partitioning, same image features and same class embeddings as SJE [<a class="ref-link" id="c2" href="#r2">2</a>]. We cross-validate the K for LatEm
  • Table3: Combining embeddings either including or not including supervision in the combination. w: the combination includes attributes, w/o: the combination does not include attributes
  • Table4: Average per-class top-1 accuracy on unseen classes (the results are averaged on five folds). SJE: [<a class="ref-link" id="c2" href="#r2">2</a>], LatEm: Latent embedding model (K is cross-validated)
  • Table5: Left) Number of matrices selected (on the original split) and (right) average per-class top-1 accuracy on unseen classes (averaged over five splits). PR: proposed model learnt with pruning, CV: with cross validation
Download tables as Excel
Related work
  • We are interested in the problem of zero-shot learning where the test classes are disjoint from the training classes [14, 20, 21, 30, 41, 42]. As visual information from such test classes is not available during training, zero-shot learning requires secondary information sources to make up for the missing visual information. While secondary information can come from different sources, usually they are derived from either large and unrestricted, but freely available, text corpora, e.g. word2vec [23], glove [29], or structured textual sources e.g. wordnet hierarchies [24], or costly human annotations e.g. manually specified attributes [7, 8, 11, 17, 20, 28, 27]. Attributes, such as ‘furry’, ‘has four legs’ etc. for animals, capture several characteristics of objects (visual classes) that help associate some and differentiate others. They are typically collected through costly human annotation [7, 17, 28] and have shown promising results [1, 3, 6, 20, 22, 33, 34, 40] in various computer vision problems.

    The image classification problem, with a secondary stream of information, could be either solved by solving related sub-problems, e.g. attribute prediction [20, 30, 31], or by a direct approach, e.g. compatibility learning between embeddings [1, 12, 38]. One such factorization could be by building intermediate attribute classifiers and then making a class prediction using a probabilistic weight of each attribute for each sample [20]. However, these methods, based on attribute classifiers, have been shown to be suboptimal [1]. This is due to their reliance on binary mappings (by thresholding attribute scores) between attributes and images which causes loss in information. On the other hand, solving the problem directly, by learning a direct mapping between images and their classes (represented as numerical vectors) has been shown to be better suited. Such label embedding methods [1, 2, 12, 13, 25, 26, 32, 35] aim to find a mapping between two embedding spaces, one each for the two streams of information e.g. visual and textual. Among these methods, CCA [13] maximizes the correlation between these two embedding spaces, [26] learns a linear compatibility between an fMRI-based image space and the semantic space, [35] learns a deep non-linear mapping between images and tags, ConSe [25] uses the probabilities of a softmax-output layer to weight the vectors of all the classes, SJE [2] and ALE [1] learn a bilinear compatibility function using a multiclass [4] and a weighted approximate ranking loss [16] respectively. DeViSE [12] does the same, however, with an efficient ranking formulation. Most recently, [32] proposes to learn this mapping by optimizing a simple objective function which has closed form solution.
Funding
  • We empirically demonstrate that our model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting
  • It turns out that our algorithm works well in practice and achieves state-of-the-art results as we empirically show in Sec. 5
  • As the training proceeds, each sampled training examples chooses one of the matrices for scoring – we keep track of this information and build a histogram over the number of matrices counting how many times each matrix was chosen by any training example. In particular, this is done by increasing the counter for Wj∗ by 1 after step 6 of Algorithm 1. With this information, after five passes over the training data, we prune out the matrices which were chosen by less than 5% of the training examples, so far
  • K is initially set to be 16, and then at every fifth epoch during training, we prune all those matrices that support less than 5% of the data points
  • Using the text embeddings obtained through human annotation, i.e. attributes (att), LatEm improves over SJE on AWA (71.9% vs. 66.7%) significantly
  • LatEm improves the results over SJE significantly on AWA (76.1% vs 73.9%)
  • Interpretability of latent embeddings In Sec. 5.1, we have demonstrated that our novel latent embedding method improves the state-of-the-art of zeroshot classification on two fine-grained datasets of birds and dogs, i.e. CUB and Dogs, and one dataset of Animals, i.e. AWA
  • The performance gaps are usually within 1-2% absolute, with the exception of AWA dataset with att and w2v with 72.5% vs. 70.7% and 52.3% vs. 49.3%, respectively for cross validation and pruning
  • With glo the performance increases until K = 10 where the final accuracy is ≈ 5% higher than with K = 1
  • The variation in performance with K seems to depend of the embeddings used, however, in the zero-shot setting, depending on the data distribution the results may vary up to 5%
  • We improved the state-of-the-art for zero-shot learning using unsupervised class embeddings on AWA up to 66.2% (vs 60.1% )and on two fine-grained datasets, achieving 34.9% accuracy (vs 29.9%) on CUB as well as achieving 36.3% accuracy (vs 35.1%) on Dogs with word2vec
  • On AWA, we also improve the accuracy obtained with supervised class embeddings, obtaining 76.1% (vs 73.9%)
Study subjects and analysis
challenging publicly available datasets: 3
We train the model with a ranking based objective function which penalizes incorrect rankings of the true class for a given image. We empirically demonstrate that our model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting. Moreover, our method leads to visually highly interpretable results with clear clusters of different fine-grained object properties that correspond to different latent variable maps

challenging publicly available datasets: 3
Moreover, our method leads to visually highly interpretable results with clear clusters of different fine-grained object properties that correspond to different latent variable maps. We evaluate the proposed model on three challenging publicly available datasets of Birds, Dogs and Animals. First, we describe the datasets, then give the implementation details and finally report the experimental results.

Datasets

datasets: 3
Caltech-UCSD Birds (CUB), Stanford Dogs (Dogs) are standard benchmarks of fine-grained recognition [7, 5, 37, 18] and Animals With Attributes (AWA) is another popular and challenging benchmark dataset [20]. All these three datasets have been used for zero-shot learning [2, 30, 17, 41]. Tab. 1 gives the statistics for them

challenging publicly available datasets: 3
Thus, the ranking based loss in Eq (6) is better suited for our piecewise linear model. We evaluate the proposed model on three challenging publicly available datasets of Birds, Dogs and Animals. First, we describe the datasets, then give the implementation details and finally report the experimental results

datasets: 3
Caltech-UCSD Birds (CUB), Stanford Dogs (Dogs) are standard benchmarks of fine-grained recognition [7, 5, 37, 18] and Animals With Attributes (AWA) is another popular and challenging benchmark dataset [20]. All these three datasets have been used for zero-shot learning [2, 30, 17, 41]. Tab. 1 gives the statistics for them

datasets: 3
For fine-grained datasets such as CUB and Dogs, as objects are visually very similar to each other, a large number of attributes are needed. Among the three datasets used, CUB contains 312 attributes, AWA contains 85 attributes while Dogs does not contain annotations for attributes. Our attribute class embedding is a vector per-class measuring the strength of each attribute based on human judgment

cases: 5
In terms of performance, both methods are competitive. Pruning outperforms cross validation on five cases and is outperformed on the remaining six cases. The performance gaps are usually within 1-2% absolute, with the exception of AWA dataset with att and w2v with 72.5% vs. 70.7% and 52.3% vs. 49.3%, respectively for cross validation and pruning

challenging benchmark datasets: 3
We proposed a ranking based objective to learn the model using an efficient and scalable SGD based solver. We empirically validated our model on three challenging benchmark datasets for zero-shot classification of Birds, Dogs and Animals. We improved the state-of-the-art for zero-shot learning using unsupervised class embeddings on AWA up to 66.2% (vs 60.1% )and on two fine-grained datasets, achieving 34.9% accuracy (vs 29.9%) on CUB as well as achieving 36.3% accuracy (vs 35.1%) on Dogs with word2vec

Reference
  • Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Labelembedding for image classification. IEEE TPAMI, 2015. 1, 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of Output Embeddings for Fine-Grained Image Classification. In CVPR, 2015. 1, 2, 3, 4, 5, 6
    Google ScholarLocate open access versionFindings
  • H. Chen, A. Gallagher, and B. Girod. What’s in a name? first names as facial attributes. In CVPR, 2012
    Google ScholarFindings
  • K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. ML, 2002. 2, 4
    Google ScholarLocate open access versionFindings
  • J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing for fine-grained recognition. In CVPR, 2013. 4
    Google ScholarFindings
  • M. Douze, A. Ramisa, and C. Schmid. Combining attributes and Fisher vectors for efficient image retrieval. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • K. Duan, D. Parikh, D. J. Crandall, and K. Grauman. Discovering localized attributes for fine-grained recognition. In CVPR, 2012. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-category generalization. In CVPR, 2010. 1, 2
    Google ScholarLocate open access versionFindings
  • A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. CVPR, 2003, 5
    Google ScholarLocate open access versionFindings
  • P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9):1627–1645, 202, 3
    Google ScholarLocate open access versionFindings
  • V. Ferrari and A. Zisserman. Learning visual attributes. In NIPS, 2007. 1, 2
    Google ScholarLocate open access versionFindings
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning (2nd Ed.). Springer Series in Statistics. Springer, 2008. 2
    Google ScholarFindings
  • S. Huang, M. Elhoseiny, A. M. Elgammal, and D. Yang. Learning hypergraph-regularized attribute predictors. In CVPR, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Hussain and B. Triggs. Feature sets and dimensionality reduction for visual object detection. In BMVC, 2010. 2, 3
    Google ScholarLocate open access versionFindings
  • T. Joachims. Training linear svms in linear time. In ACM SIGKDD, 2006. 2
    Google ScholarLocate open access versionFindings
  • P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa. Online incremental attribute-based zero-shot learning. In CVPR, 2012. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Stanford dogs dataset. http://vision.stanford.edu/aditya86/ImageNetDogs/.4
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 3
    Google ScholarLocate open access versionFindings
  • C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. In TPAMI, 2013. 1, 2, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • H. Larochelle, D. Erhan, and Y. Bengio. Zero-data learning of new tasks. In AAAI, 2008. 1, 2
    Google ScholarLocate open access versionFindings
  • J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • G. A. Miller. Wordnet: a lexical database for english. CACM, 38:39–41, 1995. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv:1312.5650, 2013. 2
    Findings
  • M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In NIPS, 2009. 2, 3
    Google ScholarLocate open access versionFindings
  • D. Papadopoulos, A. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye tracking data. In ECCV, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • M. Rohrbach, M. Stark, and B.Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR, 2011. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps here – and why? Semantic relatedness for knowledge transfer. In CVPR, 2010. 2
    Google ScholarLocate open access versionFindings
  • B. Romera-Paredes, E. OX, and P. H. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, 2015. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • W. J. Scheirer, N. Kumar, P. N. Belhumeur, and T. E. Boult. Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR, 2012. 2
    Google ScholarLocate open access versionFindings
  • B. Siddiquie, R. Feris, and L. Davis. Image ranking and retrieval based on multi-attribute queries. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS, 2013. 2, 3
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1, 5
    Google ScholarLocate open access versionFindings
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, Caltech, 2010. 4
    Google ScholarFindings
  • J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, 2011. 2, 3
    Google ScholarLocate open access versionFindings
  • B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, and F.F. Li. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011. 2
    Google ScholarLocate open access versionFindings
  • X. Yu and Y. Aloimonos. Attribute-based transfer learning for object categorization with zero or one training example. In ECCV, 2010. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR, 2012. 2, 3
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
avatar
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn