Parametric Instance Classification for Unsupervised Visual Feature Learning

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views35
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
This paper presents parametric instance classification for unsupervised visual feature learning

Abstract:

This paper presents parametric instance classification (PIC) for unsupervised visual feature learning. Unlike the state-of-the-art approaches which do instance discrimination in a dual-branch non-parametric fashion, PIC directly performs a one-branch parametric instance classification, revealing a simple framework similar to supervised ...More

Code:

Data:

0
Introduction
  • Visual feature learning has long been dominated by supervised image classification tasks, e.g. ImageNet-1K classification.
  • Unsupervised visual feature learning has started to demonstrate on par or superior transfer performance on several downstream tasks compared to supervised approaches [15, 5, 22].
  • This is encouraging, as unsupervised visual feature learning could utilize nearly unlimited data without annotations.
Highlights
  • Visual feature learning has long been dominated by supervised image classification tasks, e.g. ImageNet-1K classification
  • For dual-branch approaches, special designs are usually required to address the information leakage issue, e.g. specialized networks [1], specialized BatchNorm layers [15, 5], momentum encoder [15], and limited negative pairs [5]. Unlike these dual-branch non-parametric approaches, this paper presents a framework which solves instance discrimination by direct parametric instance classification (PIC)
  • While directly applying the usual component settings of supervised category classification for PIC will result in poor transfer performance as shown in Table 1, we show that there is no intrinsic limitation in the PIC framework, in contrast to the inherent belief in previous works [32]
  • We present a simple and effective framework, parametric instance classification (PIC), for unsupervised feature learning
  • By employing several component settings used in other state-of-the-art frameworks, a novel sliding window scheduler to address the extreme infrequent instance visiting issue, and a negative sampling and weight update correction approach to reduce training time and GPU memory consumption, the proposed PIC framework is demonstrated to perform as effectively as the state-of-the-art approaches and shows the practicality to be applied to almost unlimited training images
  • Sliding Window Scheduler classification is insignificant, the cosine soft-max loss in PIC performs significantly better than the standard soft-max loss
  • We hope that the PIC framework will serve as a simple baseline to facilitate future study
Methods
  • 2.1 Parametric Instance Classification (PIC) Framework Data Scheduler !
  • By replacing several usual component settings with the ones used in recent unsupervised frameworks [5, 6], including a cosine soft-max loss, a stronger data augmentation and a 2-layer MLP projection head, the transfer performance of the learnt features in the PIC framework are significantly improved.
  • The cosine soft-max loss is commonly used in metric learning approaches [29, 30] and in recent state-of-the-art unsupervised learning frameworks [15, 5], as
Results
  • Sliding Window Scheduler classification is insignificant, the cosine soft-max loss in PIC performs significantly better than the standard soft-max loss.
  • The authors propose two approaches to significantly reduce the training time and GPU memory consumption, making them near constant with increasing data size.
  • By 50-epoch pre-training, the sliding window based scheduler significantly outperforms the previous epoch-based scheduler by 7.1% (60.4% vs 53.3%), indicating that the proposed scheduler significantly benefits optimization
Conclusion
  • The authors present a simple and effective framework, parametric instance classification (PIC), for unsupervised feature learning.
  • Since this work is about unsupervised pre-training, which could be directly adopted in the downstream tasks.
  • If there is any failure in this system, the random initialized model is the lower bound of this unsupervised pre-trained model.
  • This pre-trained model may leverage biases in the dataset used for pre-training, but the biases of unsupervised pre-trained model may be smaller than that of supervised pre-trained model which used manual annotations
Summary
  • Introduction:

    Visual feature learning has long been dominated by supervised image classification tasks, e.g. ImageNet-1K classification.
  • Unsupervised visual feature learning has started to demonstrate on par or superior transfer performance on several downstream tasks compared to supervised approaches [15, 5, 22].
  • This is encouraging, as unsupervised visual feature learning could utilize nearly unlimited data without annotations.
  • Methods:

    2.1 Parametric Instance Classification (PIC) Framework Data Scheduler !
  • By replacing several usual component settings with the ones used in recent unsupervised frameworks [5, 6], including a cosine soft-max loss, a stronger data augmentation and a 2-layer MLP projection head, the transfer performance of the learnt features in the PIC framework are significantly improved.
  • The cosine soft-max loss is commonly used in metric learning approaches [29, 30] and in recent state-of-the-art unsupervised learning frameworks [15, 5], as
  • Results:

    Sliding Window Scheduler classification is insignificant, the cosine soft-max loss in PIC performs significantly better than the standard soft-max loss.
  • The authors propose two approaches to significantly reduce the training time and GPU memory consumption, making them near constant with increasing data size.
  • By 50-epoch pre-training, the sliding window based scheduler significantly outperforms the previous epoch-based scheduler by 7.1% (60.4% vs 53.3%), indicating that the proposed scheduler significantly benefits optimization
  • Conclusion:

    The authors present a simple and effective framework, parametric instance classification (PIC), for unsupervised feature learning.
  • Since this work is about unsupervised pre-training, which could be directly adopted in the downstream tasks.
  • If there is any failure in this system, the random initialized model is the lower bound of this unsupervised pre-trained model.
  • This pre-trained model may leverage biases in the dataset used for pre-training, but the biases of unsupervised pre-trained model may be smaller than that of supervised pre-trained model which used manual annotations
Tables
  • Table1: Applying common component settings from other frameworks into PIC
  • Table2: Ablation study on negative instance sampling and classification weight correction. # neg instance 29 210 212 214 216 218 full w/o correction 56.8 57.6 61.0 61.6 62.3 65.6 66.2 w. correction 65.5 65.8 66.0 66.1 66.2 66.2 66.2
  • Table3: Ablation study on the hyper-parameters of sliding window scheduler. ∗ denotes the default setting
  • Table4: Ablation study on sliding window w.r.t. different training epochs on ImageNet
  • Table5: Comparison of # augmentations per iteration × # epochs
  • Table6: System-level comparison of linear evaluation protocol with ResNet-
  • Table7: System-level comparison of semi-supervised classification with ResNet-50 on ImageNet
  • Table8: Comparison on transfer learning with ResNet-50
  • Table9: Top-1 and Top-5 linear classification accuracy of 5 trials
  • Table10: Ablation on sliding window on ImageNet-11K dataset
Download tables as Excel
Related work
  • Pre-text tasks Unsupervised visual feature learning is usually centered on selecting proper selfsupervised pre-text tasks, where the task targets are automatically generated without human labeling. Researchers have tried various pre-text tasks, including context prediction [10], colorization of grayscale images [34], solving jigsaw puzzles [23], the split-brain approach [35], learning to count objects [24], rotation prediction [14], learning to cluster [3], and predicting missing parts [18]. Recently, the attention of this field has mainly shifted to a specific pre-text task of instance discrimination by treating every image as a distinct class [12, 32, 36, 15, 22, 5], which demonstrates performance superior to other pre-text tasks. The PIC framework also follows this line by utilizing instance discrimination as its pre-text task.

    Non-parametric instance discrimination approaches The state-of-the-art instance discrimination approaches, i.e. SimCLR [5] and MoCo v2 [15, 6] all follow a dual-branch structure in training, where two augmentation views of each image are required to be sampled at an optimization iteration. The instance discrimination is achieved by encouraging agreement of the two views from the same image and dispersing that of augmentation views from different images.
Reference
  • Bachman, P., Hjelm, R. D., and Buchwalter, W. (2019). Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pages 15509–15519.
    Google ScholarLocate open access versionFindings
  • Bradley, P. S., Bennett, K. P., and Demiriz, A. (2000). Constrained k-means clustering. Microsoft Research, Redmond, 20(0):0.
    Google ScholarLocate open access versionFindings
  • Caron, M., Bojanowski, P., Joulin, A., and Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149.
    Google ScholarLocate open access versionFindings
  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848.
    Google ScholarLocate open access versionFindings
  • Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020a). A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.
    Findings
  • Chen, X., Fan, H., Girshick, R., and He, K. (2020b). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.
    Findings
  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223.
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
    Google ScholarLocate open access versionFindings
  • Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Doersch, C., Gupta, A., and Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430.
    Google ScholarLocate open access versionFindings
  • Donahue, J. and Simonyan, K. (2019). Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pages 10541–10551.
    Google ScholarLocate open access versionFindings
  • Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pages 766–774.
    Google ScholarLocate open access versionFindings
  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338.
    Google ScholarLocate open access versionFindings
  • Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728.
    Findings
  • He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2019a). Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722.
    Findings
  • He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
    Google ScholarLocate open access versionFindings
  • He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M. (2019b). Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 558–567.
    Google ScholarLocate open access versionFindings
  • Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. v. d. (2019). Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272.
    Findings
  • Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free.
    Google ScholarLocate open access versionFindings
  • Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. (2019). Big transfer (bit): General visual representation learning.
    Google ScholarFindings
  • Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440.
    Google ScholarLocate open access versionFindings
  • Misra, I. and van der Maaten, L. (2019). Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991.
    Findings
  • Noroozi, M. and Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer.
    Google ScholarLocate open access versionFindings
  • Noroozi, M., Pirsiavash, H., and Favaro, P. (2017). Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pages 5898–5906.
    Google ScholarLocate open access versionFindings
  • Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A. (2019). Weight standardization.
    Google ScholarFindings
  • Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99.
    Google ScholarLocate open access versionFindings
  • Tian, Y., Krishnan, D., and Isola, P. (2019). Contrastive multiview coding. arXiv preprint arXiv:1906.05849.
    Findings
  • Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. (2018). The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778.
    Google ScholarLocate open access versionFindings
  • Wang, F., Xiang, X., Cheng, J., and Yuille, A. L. (2017). Normface. Proceedings of the 2017 ACM on Multimedia Conference - MM ’17.
    Google ScholarLocate open access versionFindings
  • Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., and Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2.
    Findings
  • Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. (2018). Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742.
    Google ScholarLocate open access versionFindings
  • Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller, J., Manmatha, R., Li, M., and Smola, A. (2020). Resnest: Split-attention networks.
    Google ScholarFindings
  • Zhang, R., Isola, P., and Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision, pages 649–666. Springer.
    Google ScholarLocate open access versionFindings
  • Zhang, R., Isola, P., and Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1058–1067.
    Google ScholarLocate open access versionFindings
  • Zhuang, C., Zhai, A. L., and Yamins, D. (2019). Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pages 6002–6012.
    Google ScholarLocate open access versionFindings
  • 4. There are two observations:
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments