Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination

computer vision and pattern recognition, 2018.

Cited by: 11|Bibtex|Views129
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We present an unsupervised feature learning approach by maximizing distinction between instances via a novel nonparametric softmax formulation

Abstract:

Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so. We study whether this observation can be extended beyond the conventional domain of supervised learning: Can we learn a good feature representation that captures apparent similari...More

Code:

Data:

0
Introduction
  • The rise of deep neural networks, especially convolutional neural networks (CNN), has led to several breakthroughs in computer vision benchmarks.
  • Fig. 1 shows that an image from class leopard is rated much higher by class jaguar rather than by class bookcase [11].
  • Such observations reveal that a typical discriminative learning method can automatically discover apparent similarity among semantic categories, without being explicitly guided to do so.
  • Apparent similarity is learned not from semantic annotations, but from the visual data themselves
Highlights
  • The rise of deep neural networks, especially convolutional neural networks (CNN), has led to several breakthroughs in computer vision benchmarks
  • Our novel approach to unsupervised learning stems from a few observations on the results of supervised learning for object recognition
  • We report and compare experimental results with both Support Vector Machine (SVM) and k-nearest neighbors (kNN) accuracies
  • With Resnet-50, our method achieves an mean average precision (mAP) of 65.4%, surpassing all existing unsupervised learning approaches
  • We present an unsupervised feature learning approach by maximizing distinction between instances via a novel nonparametric softmax formulation
  • Our experimental results demonstrate that, under unsupervised learning settings, our method surpasses the stateof-the-art on image classification by a large margin, with top-1 accuracy 42.5% on ImageNet 1K [1] and 38.7% for Places 205 [49]
  • Our experimental results show that our method outperforms the state-of-the-art on image classification on ImageNet and Places, with a compact 128-dimensional representation that scales well with more data and deeper networks
Methods
  • Method mAP Method mAP AlexNet

    Labels† 56.8 VGG Labels† 67.3

    Gaussian 43.4 Gaussian 39.7 Data-Init [16] 45.6 Video [44] 60.2 Context [2] 51.1 Context [2] 61.5 Adversarial [4] 46.9 Transitivity [45] 63.2

    Color [47] 46.9 Ours VGG 60.5 Video [44] 47.4 ResNet Labels† 76.2 Ours Alexnet 48.1 Ours ResNet 65.4

    weights below the 3rd type of residual blocks, only updating the layers above and freezing all batch normalization layers.
  • Color [47] 46.9 Ours VGG 60.5 Video [44] 47.4 ResNet Labels† 76.2 Ours Alexnet 48.1 Ours ResNet 65.4.
  • The authors compare three settings: 1) directly training from scratch, 2) pretraining on ImageNet in a supervised way, and 3) pretraining on ImageNet or other data using various unsupervised methods.
  • With AlexNet and VGG16, the method achieves an mAP of 48.1% and 60.5%, on par with the state-of-the-art unsupervised methods.
  • With Resnet-50, the method achieves an mAP of 65.4%, surpassing all existing unsupervised learning approaches.
  • There remains a significant gap of 11% to be narrowed towards mAP 76.2% from supervised pretraining
Results
  • Evaluation of Semi

    Supervised Learning Ours-Resnet

    Scratch-Resnet Color-Resnet-152

    Ours-Alexnet Scratch-Alexnet

    SplitBrain-Alexnet

    1%2% 4T%he Amount 1o0f%Labeled Data 20%

    and treat others as unlabeled.
  • 1%2% 4T%he Amount 1o0f%Labeled Data 20%.
  • In order to compare with [19], the authors report the top-5 accuracy here.
  • The authors compare the method with three baselines: (1) Scratch, i.e. fully supervised training on the small labeled subsets, (2) Split-brain [48] for pre-training, and (3) Colorization [19] for pre-training.
  • Finetuning on the labeled subset takes 70 epochs with initial learning rate 0.01 and a decay rate of 10 every 30 epochs.
  • The authors vary the proportion of labeled subset from 1% to 20% of the entire dataset
Conclusion
  • The authors present an unsupervised feature learning approach by maximizing distinction between instances via a novel nonparametric softmax formulation.
  • It is motivated by the observation that supervised learning results in apparent image similarity.
  • The authors' experimental results show that the method outperforms the state-of-the-art on image classification on ImageNet and Places, with a compact 128-dimensional representation that scales well with more data and deeper networks.
  • It delivers competitive generalization results on semi-supervised learning and object detection tasks
Summary
  • Introduction:

    The rise of deep neural networks, especially convolutional neural networks (CNN), has led to several breakthroughs in computer vision benchmarks.
  • Fig. 1 shows that an image from class leopard is rated much higher by class jaguar rather than by class bookcase [11].
  • Such observations reveal that a typical discriminative learning method can automatically discover apparent similarity among semantic categories, without being explicitly guided to do so.
  • Apparent similarity is learned not from semantic annotations, but from the visual data themselves
  • Methods:

    Method mAP Method mAP AlexNet

    Labels† 56.8 VGG Labels† 67.3

    Gaussian 43.4 Gaussian 39.7 Data-Init [16] 45.6 Video [44] 60.2 Context [2] 51.1 Context [2] 61.5 Adversarial [4] 46.9 Transitivity [45] 63.2

    Color [47] 46.9 Ours VGG 60.5 Video [44] 47.4 ResNet Labels† 76.2 Ours Alexnet 48.1 Ours ResNet 65.4

    weights below the 3rd type of residual blocks, only updating the layers above and freezing all batch normalization layers.
  • Color [47] 46.9 Ours VGG 60.5 Video [44] 47.4 ResNet Labels† 76.2 Ours Alexnet 48.1 Ours ResNet 65.4.
  • The authors compare three settings: 1) directly training from scratch, 2) pretraining on ImageNet in a supervised way, and 3) pretraining on ImageNet or other data using various unsupervised methods.
  • With AlexNet and VGG16, the method achieves an mAP of 48.1% and 60.5%, on par with the state-of-the-art unsupervised methods.
  • With Resnet-50, the method achieves an mAP of 65.4%, surpassing all existing unsupervised learning approaches.
  • There remains a significant gap of 11% to be narrowed towards mAP 76.2% from supervised pretraining
  • Results:

    Evaluation of Semi

    Supervised Learning Ours-Resnet

    Scratch-Resnet Color-Resnet-152

    Ours-Alexnet Scratch-Alexnet

    SplitBrain-Alexnet

    1%2% 4T%he Amount 1o0f%Labeled Data 20%

    and treat others as unlabeled.
  • 1%2% 4T%he Amount 1o0f%Labeled Data 20%.
  • In order to compare with [19], the authors report the top-5 accuracy here.
  • The authors compare the method with three baselines: (1) Scratch, i.e. fully supervised training on the small labeled subsets, (2) Split-brain [48] for pre-training, and (3) Colorization [19] for pre-training.
  • Finetuning on the labeled subset takes 70 epochs with initial learning rate 0.01 and a decay rate of 10 every 30 epochs.
  • The authors vary the proportion of labeled subset from 1% to 20% of the entire dataset
  • Conclusion:

    The authors present an unsupervised feature learning approach by maximizing distinction between instances via a novel nonparametric softmax formulation.
  • It is motivated by the observation that supervised learning results in apparent image similarity.
  • The authors' experimental results show that the method outperforms the state-of-the-art on image classification on ImageNet and Places, with a compact 128-dimensional representation that scales well with more data and deeper networks.
  • It delivers competitive generalization results on semi-supervised learning and object detection tasks
Tables
  • Table1: Top-1 accuracies on CIFAR10, by applying linear SVM or kNN classifiers on the learned features. Our non-parametric softmax outperforms parametric softmax, and NCE provides close approximation as m increases
  • Table2: Top-1 classification accuracies on ImageNet
  • Table3: Top-1 classification accuracies on Places, based directly on features learned on ImageNet, without any fine-tuning
  • Table4: Classification performance on ImageNet with ResNet18 for different embedding feature sizes
  • Table5: Classification performances trained on different amount of training set with ResNet-18
  • Table6: Object detection performance on PASCAL VOC 2007 test, in terms of mean average precision (mAP), for supervised pretraining methods (marked by †), existing unsupervised methods, and our method
Download tables as Excel
Related work
  • There has been growing interest in unsupervised learning without human-provided labels. Previous works mainly fall into two categories: 1) generative models and 2) selfsupervised approaches.

    Generative Models. The primary objective of generative models is to reconstruct the distribution of data as faithfully as possible. Classical generative models include Restricted Bolztmann Machines (RBMs) [12, 39, 21], and Autoencoders [40, 20]. The latent features produced by generative models could also help object recognition. Recent approaches such as generative adversarial networks [8, 4] and variational auto-encoder [14] improve both generative qualities and feature learning.
Funding
  • This work was supported in part by Berkeley Deep Drive, Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No TS1610626), and the General Research Fund (GRF) of Hong Kong (No 14236516)
Reference
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 2009. 2
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015. 1, 2, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. arXiv preprint arXiv:1708.07860, 2017. 2, 5, 6
    Findings
  • J. Donahue, P. Krahenbuhl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016. 2, 5, 6, 8
    Findings
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 8
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast r-cnn. In ICCV, 2015. 8
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. 2
    Google ScholarLocate open access versionFindings
  • M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010. 2, 4
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. 5
    Findings
  • G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 1, 3
    Findings
  • G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006. 2
    Google ScholarFindings
  • D. Jayaraman and K. Grauman. Learning image representations tied to egomotion from unlabeled video. IJCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2
    Findings
  • M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR. IEEE, 2012. 2
    Google ScholarFindings
  • P. Krahenbuhl, C. Doersch, J. Donahue, and T. Darrell. Datadependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015. 6, 8
    Findings
  • A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 5
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 5
    Google ScholarLocate open access versionFindings
  • G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a proxy task for visual understanding. CVPR, 2017. 8
    Google ScholarLocate open access versionFindings
  • Q. V. Le. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013. 2
    Google ScholarLocate open access versionFindings
  • H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning. ACM, 2009. 2
    Google ScholarLocate open access versionFindings
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV. IEEE, 2011. 1
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 4
    Google ScholarFindings
  • A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, 2013. 4
    Google ScholarFindings
  • F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Aistats, volume 5. Citeseer, 2005. 4
    Google ScholarLocate open access versionFindings
  • M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. Springer, 2016. 2, 5, 6
    Google ScholarFindings
  • M. Noroozi, H. Pirsiavash, and P. Favaro. Representation learning by learning to count. arXiv preprint arXiv:1708.06734, 2017. 2, 6
    Findings
  • N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends R in Optimization, 2014. 2, 4
    Google ScholarLocate open access versionFindings
  • D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan. Learning features by watching objects move. arXiv preprint arXiv:1612.06370, 2016. 2
    Findings
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016. 2, 5
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 8
    Google ScholarLocate open access versionFindings
  • S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis. Adv. Neural Inf. Process. Syst.(NIPS), 17, 2004. 2
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. 5
    Google ScholarLocate open access versionFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 5
    Findings
  • J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175, 2017. 2
    Findings
  • K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Tang, R. Salakhutdinov, and G. Hinton. Robust boltzmann machines for recognition and denoising. In CVPR. IEEE, 2012. 2
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning. ACM, 2008. 2
    Google ScholarLocate open access versionFindings
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In ECCV. Springer, 2016. 2
    Google ScholarFindings
  • F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2 hypersphere embedding for face verification. arXiv preprint arXiv:1704.06369, 2017. 2, 3
    Findings
  • X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015. 2, 6, 8
    Google ScholarLocate open access versionFindings
  • X. Wang, K. He, and A. Gupta. Transitive invariance for self-supervised visual representation learning. arXiv preprint arXiv:1708.02901, 2017. 2, 8
    Findings
  • T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identification feature learning for person search. CVPR, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. ECCV, 2016. 2, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. CVPR, 2017. 5, 6, 8
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014. 2, 6
    Google ScholarLocate open access versionFindings
  • T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. arXiv preprint arXiv:1704.07813, 2017. 2
    Findings
Full Text
Your rating :
0

 

Tags
Comments