AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Latent Embeddings For Zero-Shot Classification
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), no. 1 (2016): 69-77
- Zero-shot classification [14, 20, 21, 30, 41] is a challenging problem. The task is generally set as follows: training images are provided for certain visual classes and the classifier is expected to predict the presence or absence of novel classes at test time.
- In fine grained image collections, images that belong to different classes are visually similar to each other, e.g. different bird species
- Image labeling for such collections is a costly process, as it requires either expert opinion or a large number of attributes.
- The class embeddings reflect the common and distinguishing properties of different classes using sideinformation that is extracted independently of images
- Using these embeddings, the compatibility can be computed even with those unknown classes which have no corresponding images in the training set.
- The compatibility function takes a simple form,
- Zero-shot classification [14, 20, 21, 30, 41] is a challenging problem
- Interpretability of latent embeddings In Sec. 5.1, we have demonstrated that our novel latent embedding method improves the state-of-the-art of zeroshot classification on two fine-grained datasets of birds and dogs, i.e. Caltech-UCSD Birds (CUB) and Dogs, and one dataset of Animals, i.e. Animals With Attributes (AWA)
- We presented a novel latent variable based model, Latent Embeddings (LatEm), for learning a nonlinear compatibility function for the task of zero-shot classification
- We proposed a ranking based objective to learn the model using an efficient and scalable stochastic gradient descent (SGD) based solver
- We improved the state-of-the-art for zero-shot learning using unsupervised class embeddings on AWA up to 66.2%and on two fine-grained datasets, achieving 34.9% accuracy on CUB as well as achieving 36.3% accuracy on Dogs with word2vec
- K is initially set to be 16, and at every fifth epoch during training, we prune all those matrices that support less than 5% of the data points
- On AWA, we improve the accuracy obtained with supervised class embeddings, obtaining 76.1%
- The authors evaluate the proposed model on three challenging publicly available datasets of Birds, Dogs and Animals.
- Caltech-UCSD Birds (CUB), Stanford Dogs (Dogs) are standard benchmarks of fine-grained recognition [7, 5, 37, 18] and Animals With Attributes (AWA) is another popular and challenging benchmark dataset .
- All these three datasets have been used for zero-shot learning [2, 30, 17, 41].
- The authors empirically demonstrate that the model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting.
- Each sampled training examples chooses one of the matrices for scoring – the authors keep track of this information and build a histogram over the number of matrices counting how many times each matrix was chosen by any training example
- This is done by increasing the counter for Wj∗ by 1 after step 6 of Algorithm 1.
- On AWA, the authors improve the accuracy obtained with supervised class embeddings, obtaining 76.1%
- LatEm builds on the idea of Structured Joint
Embeddings (SJE) . The authors discuss below the differences between LatEm and SJE and emphasize the technical contributions.
LatEm learns a piecewise linear compatibility function through multiple Wi matrices whereas SJE  is linear.
- The task of discrimination against different bird species would be handled only by the second one, which would arguably be easier.The authors presented a novel latent variable based model, Latent Embeddings (LatEm), for learning a nonlinear compatibility function for the task of zero-shot classification.
- On AWA, the authors improve the accuracy obtained with supervised class embeddings, obtaining 76.1%
- This demonstrates quantitatively that the method learns a latent structure in the embedding space through multiple matrices.
- Such pruning based method speeds the training up and leads to models with competitive space-time complexities cf. the cross-validation based method
- Table1: The statistics of the three datasets used. CUB and Dog are fine-grained datasets whereas AWA is a more general concept dataset
- Table2: Comparison of Latent Embeddings (LatEm) method with the state-of-the-art SJE [<a class="ref-link" id="c2" href="#r2">2</a>] method. We report average per-class Top-1 accuracy on unseen classes. We use the same data partitioning, same image features and same class embeddings as SJE [<a class="ref-link" id="c2" href="#r2">2</a>]. We cross-validate the K for LatEm
- Table3: Combining embeddings either including or not including supervision in the combination. w: the combination includes attributes, w/o: the combination does not include attributes
- Table4: Average per-class top-1 accuracy on unseen classes (the results are averaged on five folds). SJE: [<a class="ref-link" id="c2" href="#r2">2</a>], LatEm: Latent embedding model (K is cross-validated)
- Table5: Left) Number of matrices selected (on the original split) and (right) average per-class top-1 accuracy on unseen classes (averaged over five splits). PR: proposed model learnt with pruning, CV: with cross validation
- We are interested in the problem of zero-shot learning where the test classes are disjoint from the training classes [14, 20, 21, 30, 41, 42]. As visual information from such test classes is not available during training, zero-shot learning requires secondary information sources to make up for the missing visual information. While secondary information can come from different sources, usually they are derived from either large and unrestricted, but freely available, text corpora, e.g. word2vec , glove , or structured textual sources e.g. wordnet hierarchies , or costly human annotations e.g. manually specified attributes [7, 8, 11, 17, 20, 28, 27]. Attributes, such as ‘furry’, ‘has four legs’ etc. for animals, capture several characteristics of objects (visual classes) that help associate some and differentiate others. They are typically collected through costly human annotation [7, 17, 28] and have shown promising results [1, 3, 6, 20, 22, 33, 34, 40] in various computer vision problems.
The image classification problem, with a secondary stream of information, could be either solved by solving related sub-problems, e.g. attribute prediction [20, 30, 31], or by a direct approach, e.g. compatibility learning between embeddings [1, 12, 38]. One such factorization could be by building intermediate attribute classifiers and then making a class prediction using a probabilistic weight of each attribute for each sample . However, these methods, based on attribute classifiers, have been shown to be suboptimal . This is due to their reliance on binary mappings (by thresholding attribute scores) between attributes and images which causes loss in information. On the other hand, solving the problem directly, by learning a direct mapping between images and their classes (represented as numerical vectors) has been shown to be better suited. Such label embedding methods [1, 2, 12, 13, 25, 26, 32, 35] aim to find a mapping between two embedding spaces, one each for the two streams of information e.g. visual and textual. Among these methods, CCA  maximizes the correlation between these two embedding spaces,  learns a linear compatibility between an fMRI-based image space and the semantic space,  learns a deep non-linear mapping between images and tags, ConSe  uses the probabilities of a softmax-output layer to weight the vectors of all the classes, SJE  and ALE  learn a bilinear compatibility function using a multiclass  and a weighted approximate ranking loss  respectively. DeViSE  does the same, however, with an efficient ranking formulation. Most recently,  proposes to learn this mapping by optimizing a simple objective function which has closed form solution.
- We empirically demonstrate that our model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting
- It turns out that our algorithm works well in practice and achieves state-of-the-art results as we empirically show in Sec. 5
- As the training proceeds, each sampled training examples chooses one of the matrices for scoring – we keep track of this information and build a histogram over the number of matrices counting how many times each matrix was chosen by any training example. In particular, this is done by increasing the counter for Wj∗ by 1 after step 6 of Algorithm 1. With this information, after five passes over the training data, we prune out the matrices which were chosen by less than 5% of the training examples, so far
- K is initially set to be 16, and then at every fifth epoch during training, we prune all those matrices that support less than 5% of the data points
- Using the text embeddings obtained through human annotation, i.e. attributes (att), LatEm improves over SJE on AWA (71.9% vs. 66.7%) significantly
- LatEm improves the results over SJE significantly on AWA (76.1% vs 73.9%)
- Interpretability of latent embeddings In Sec. 5.1, we have demonstrated that our novel latent embedding method improves the state-of-the-art of zeroshot classification on two fine-grained datasets of birds and dogs, i.e. CUB and Dogs, and one dataset of Animals, i.e. AWA
- The performance gaps are usually within 1-2% absolute, with the exception of AWA dataset with att and w2v with 72.5% vs. 70.7% and 52.3% vs. 49.3%, respectively for cross validation and pruning
- With glo the performance increases until K = 10 where the final accuracy is ≈ 5% higher than with K = 1
- The variation in performance with K seems to depend of the embeddings used, however, in the zero-shot setting, depending on the data distribution the results may vary up to 5%
- We improved the state-of-the-art for zero-shot learning using unsupervised class embeddings on AWA up to 66.2% (vs 60.1% )and on two fine-grained datasets, achieving 34.9% accuracy (vs 29.9%) on CUB as well as achieving 36.3% accuracy (vs 35.1%) on Dogs with word2vec
- On AWA, we also improve the accuracy obtained with supervised class embeddings, obtaining 76.1% (vs 73.9%)
- Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Labelembedding for image classification. IEEE TPAMI, 2015. 1, 2, 3, 5
- Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of Output Embeddings for Fine-Grained Image Classification. In CVPR, 2015. 1, 2, 3, 4, 5, 6
- H. Chen, A. Gallagher, and B. Girod. What’s in a name? first names as facial attributes. In CVPR, 2012
- K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. ML, 2002. 2, 4
- J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing for fine-grained recognition. In CVPR, 2013. 4
- M. Douze, A. Ramisa, and C. Schmid. Combining attributes and Fisher vectors for efficient image retrieval. In CVPR, 2011. 2
- K. Duan, D. Parikh, D. J. Crandall, and K. Grauman. Discovering localized attributes for fine-grained recognition. In CVPR, 2012. 1, 2, 4
- A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-category generalization. In CVPR, 2010. 1, 2
- A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. CVPR, 2003, 5
- P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9):1627–1645, 202, 3
- V. Ferrari and A. Zisserman. Learning visual attributes. In NIPS, 2007. 1, 2
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. 1, 2, 3
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning (2nd Ed.). Springer Series in Statistics. Springer, 2008. 2
- S. Huang, M. Elhoseiny, A. M. Elgammal, and D. Yang. Learning hypergraph-regularized attribute predictors. In CVPR, 2015. 1, 2
- S. Hussain and B. Triggs. Feature sets and dimensionality reduction for visual object detection. In BMVC, 2010. 2, 3
- T. Joachims. Training linear svms in linear time. In ACM SIGKDD, 2006. 2
- P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa. Online incremental attribute-based zero-shot learning. In CVPR, 2012. 1, 2, 4
- A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Stanford dogs dataset. http://vision.stanford.edu/aditya86/ImageNetDogs/.4
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 3
- C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. In TPAMI, 2013. 1, 2, 3, 4, 5
- H. Larochelle, D. Erhan, and Y. Bengio. Zero-data learning of new tasks. In AAAI, 2008. 1, 2
- J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011. 2
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 1, 2, 5
- G. A. Miller. Wordnet: a lexical database for english. CACM, 38:39–41, 1995. 1, 2, 5
- M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv:1312.5650, 2013. 2
- M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In NIPS, 2009. 2, 3
- D. Papadopoulos, A. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye tracking data. In ECCV, 2014. 1, 2
- D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 1, 2
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014. 1, 2, 5
- M. Rohrbach, M. Stark, and B.Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR, 2011. 1, 2, 4
- M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps here – and why? Semantic relatedness for knowledge transfer. In CVPR, 2010. 2
- B. Romera-Paredes, E. OX, and P. H. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, 2015. 1, 2, 3
- W. J. Scheirer, N. Kumar, P. N. Belhumeur, and T. E. Boult. Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR, 2012. 2
- B. Siddiquie, R. Feris, and L. Davis. Image ranking and retrieval based on multi-attribute queries. In CVPR, 2011. 2
- R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS, 2013. 2, 3
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1, 5
- P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, Caltech, 2010. 4
- J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011. 2, 3, 4
- Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, 2011. 2, 3
- B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, and F.F. Li. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011. 2
- X. Yu and Y. Aloimonos. Attribute-based transfer learning for object categorization with zero or one training example. In ECCV, 2010. 1, 2, 4
- Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In CVPR, 2016. 2
- X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR, 2012. 2, 3