Prototypical Contrastive Learning of Unsupervised Representations

Cited by: 15|Bibtex|Views376
Other Links: arxiv.org
Weibo:
This paper proposed Prototypical Contrastive Learning, a generic unsupervised representation learning framework that finds network parameters to maximize the log-likelihood of the observed data

Abstract:

This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of the popular instance-wise contrastive learning. PCL implicitly encodes semantic structures of the data into the learned embedding space, and prevents the network from solely relying on...More

Code:

Data:

0
Introduction
  • Unsupervised visual representation learning aims to learn image representations from pixels themselves without relying on semantic annotations, and recent advances are largely driven by instance discrimination tasks [1, 2, 3, 4, 5, 6, 7].
  • Instance-wise contrastive learning leads to an embedding space where all instances are well-separated, and each instance is locally smooth
  • Despite their improved performance, instance discrimination based methods share a common fundamental weakness: semantic structure of data is not encoded by the learned representations.
  • Instance discrimination based methods share a common fundamental weakness: semantic structure of data is not encoded by the learned representations
  • This problem arises because instance-wise contrastive learning treats two samples as a negative pair as long as they are from different instances, regardless of their semantic similarity.
  • This is magnified by the fact that thousands of negative samples are generated to form the contrastive loss, leading to many negative pairs that share similar semantics but are undesirably pushed apart in the embedding space
Highlights
  • Unsupervised visual representation learning aims to learn image representations from pixels themselves without relying on semantic annotations, and recent advances are largely driven by instance discrimination tasks [1, 2, 3, 4, 5, 6, 7]
  • We propose prototypical contrastive learning (PCL), a new framework for unsupervised representation learning that implicitly encodes the semantic structure of data into the embedding space
  • This paper proposed Prototypical Contrastive Learning, a generic unsupervised representation learning framework that finds network parameters to maximize the log-likelihood of the observed data
  • Prototypical Contrastive Learning learns an embedding space which encodes the semantic structure of data, by training on the proposed ProtoNCE loss
  • Our extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of Prototypical Contrastive Learning for unsupervised representation learning
  • Our research advances unsupervised representation learning especially for computer vision, which alleviates the need for expensive human annotation when training deep neural network models
Methods
  • Random Supervised ResNet-50 Jigsaw [24, 36] MoCo [3] PCL.
  • The authors perform semi-supervised learning experiments to evaluate whether the learned representation can provide a good basis for fine-tuning.
  • Following the setup from [1, 4], the authors randomly select a subset (1% or 10%) of ImageNet training data, and fine-tune the self-supervised trained model on these subsets.
  • The authors' method sets a new state-of-the-art under 200 training epochs, outperforming both self-supervised learning methods and semi-supervised learning methods.
  • Supervised [36] Supervised Jigsaw [24, 36] MoCo [3] PCL
Results
  • The authors' method significantly outperforms previous methods while requiring fewer number of neighbors (20 neighbors as compared to 200 in [1, 10]).
Conclusion
  • This paper proposed Prototypical Contrastive Learning, a generic unsupervised representation learning framework that finds network parameters to maximize the log-likelihood of the observed data.
  • The authors' extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of PCL for unsupervised representation learning.
  • The authors' research advances unsupervised representation learning especially for computer vision, which alleviates the need for expensive human annotation when training deep neural network models.
  • Unsupervised representation learning puts heavy requirement on computational resource during the pretraining stage, which could be costly both financially and environmentally.
  • As part of the efforts, the authors will release the pretrained models to facilitate future research in downstream applications without the expensive retraining
Summary
  • Introduction:

    Unsupervised visual representation learning aims to learn image representations from pixels themselves without relying on semantic annotations, and recent advances are largely driven by instance discrimination tasks [1, 2, 3, 4, 5, 6, 7].
  • Instance-wise contrastive learning leads to an embedding space where all instances are well-separated, and each instance is locally smooth
  • Despite their improved performance, instance discrimination based methods share a common fundamental weakness: semantic structure of data is not encoded by the learned representations.
  • Instance discrimination based methods share a common fundamental weakness: semantic structure of data is not encoded by the learned representations
  • This problem arises because instance-wise contrastive learning treats two samples as a negative pair as long as they are from different instances, regardless of their semantic similarity.
  • This is magnified by the fact that thousands of negative samples are generated to form the contrastive loss, leading to many negative pairs that share similar semantics but are undesirably pushed apart in the embedding space
  • Methods:

    Random Supervised ResNet-50 Jigsaw [24, 36] MoCo [3] PCL.
  • The authors perform semi-supervised learning experiments to evaluate whether the learned representation can provide a good basis for fine-tuning.
  • Following the setup from [1, 4], the authors randomly select a subset (1% or 10%) of ImageNet training data, and fine-tune the self-supervised trained model on these subsets.
  • The authors' method sets a new state-of-the-art under 200 training epochs, outperforming both self-supervised learning methods and semi-supervised learning methods.
  • Supervised [36] Supervised Jigsaw [24, 36] MoCo [3] PCL
  • Results:

    The authors' method significantly outperforms previous methods while requiring fewer number of neighbors (20 neighbors as compared to 200 in [1, 10]).
  • Conclusion:

    This paper proposed Prototypical Contrastive Learning, a generic unsupervised representation learning framework that finds network parameters to maximize the log-likelihood of the observed data.
  • The authors' extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of PCL for unsupervised representation learning.
  • The authors' research advances unsupervised representation learning especially for computer vision, which alleviates the need for expensive human annotation when training deep neural network models.
  • Unsupervised representation learning puts heavy requirement on computational resource during the pretraining stage, which could be costly both financially and environmentally.
  • As part of the efforts, the authors will release the pretrained models to facilitate future research in downstream applications without the expensive retraining
Tables
  • Table1: Low-shot image classification on both VOC07 and Places205 datasets using linear SVMs trained on fixed representations. All methods were pretrained on ImageNet-1M dataset (except for Jigsaw [<a class="ref-link" id="c24" href="#r24">24</a>, <a class="ref-link" id="c36" href="#r36">36</a>] trained on ImageNet-14M). We vary the number of labeled examples k and report the mAP (for VOC) and accuracy (for Places) across 5 runs. Results for Jigsaw were taken from [<a class="ref-link" id="c36" href="#r36">36</a>]. We use the released pretrained model for MoCo, and re-implement SimCLR. MoCo, SimCLR, and PCL are trained for the same number of epochs (200 epochs)
  • Table2: Semi-supervised learning on ImageNet. We report top-5 accuracy on the ImageNet validation set of self-supervised models that are finetuned on 1% or 10% of labeled data. We use the released pretrained model for MoCo, and re-implement SimCLR; all other numbers are adopted from corresponding papers
  • Table3: Image classification with linear models. We report top-1 accuracy. Numbers with ∗ are from released pretrained model; all other numbers are adopted from corresponding papers. †: LocalAgg and SelfLabel uses 10-crop evaluation. CMC and ADMIM use FastAutoAugment [<a class="ref-link" id="c43" href="#r43">43</a>] that is supervised by ImageNet labels. SimCLR requires a large batch size of 4096 allocated on 128 TPUs
  • Table4: Image classification with kNN classifiers using ResNet-50 features on ImageNet. We report top-1 accuracy. Results for [<a class="ref-link" id="c1" href="#r1">1</a>, <a class="ref-link" id="c10" href="#r10">10</a>] are taken from corresponding papers. Result for MoCo is from released model
  • Table5: Object detection for frozen conv body on VOC using Faster R-CNN. We measure the average mAP@0.5 on VOC07 test set across three runs
Download tables as Excel
Related work
  • Our work is closely related to two main branches of studies in unsupervised/self-supervised learning: instance-wise contrastive learning and deep unsupervised clustering. Instance-wise Contrastive Learning. At the core of state-of-the-art unsupervised representation learning algorithms [1, 2, 3, 4, 10, 5, 6, 7, 8], instance-wise contrastive learning aims to learn an embedding space where samples (e.g. crops) from the same instance (e.g. an image) are pulled closer and samples from different instances are pushed apart. To construct the contrastive loss for a mini-batch of samples, positive instance features and negative instance features are generated for each sample. Different contrastive learning methods vary in their strategy to generate instance features. The memory bank approach [1] stores the features of all samples calculated in the previous step, and selects from the memory bank to form positive and negative pairs. The end-to-end approach [2, 7, 8] generates instance features using all samples within the current mini-batch, and apply the same encoder to both the original samples and their augmented version. Recently, the momentum encoder (MoCo) approach [3] is proposed, which encodes samples on-the-fly by a momentum-updated encoder, and maintains a queue of instance features.
Reference
  • Wu, Z., Y. Xiong, S. X. Yu, et al. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742. 2018.
    Google ScholarLocate open access versionFindings
  • Ye, M., X. Zhang, P. C. Yuen, et al. Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, pages 6210–6219. 2019.
    Google ScholarLocate open access versionFindings
  • He, K., H. Fan, Y. Wu, et al. Momentum contrast for unsupervised visual representation learning. In CVPR. 2020.
    Google ScholarFindings
  • Misra, I., L. van der Maaten. Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991, 2019.
    Findings
  • Hjelm, R. D., A. Fedorov, S. Lavoie-Marchildon, et al. Learning deep representations by mutual information estimation and maximization. In ICLR. 2019.
    Google ScholarLocate open access versionFindings
  • Oord, A. v. d., Y. Li, O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Tian, Y., D. Krishnan, P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
    Findings
  • Chen, T., S. Kornblith, M. Norouzi, et al. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
    Findings
  • Gutmann, M., A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, pages 297–304. 2010.
    Google ScholarLocate open access versionFindings
  • Zhuang, C., A. L. Zhai, D. Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, pages 6002–6012. 2019.
    Google ScholarLocate open access versionFindings
  • Tschannen, M., J. Djolonga, P. K. Rubenstein, et al. On mutual information maximization for representation learning. In ICLR. 2020.
    Google ScholarFindings
  • Saunshi, N., O. Plevrakis, S. Arora, et al. A theoretical analysis of contrastive unsupervised representation learning. In ICML, pages 5628–5637. 2019.
    Google ScholarLocate open access versionFindings
  • Xie, J., R. B. Girshick, A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, pages 478–487. 2016.
    Google ScholarLocate open access versionFindings
  • Yang, J., D. Parikh, D. Batra. Joint unsupervised learning of deep representations and image clusters. In CVPR, pages 5147–5156. 2016.
    Google ScholarLocate open access versionFindings
  • Liao, R., A. G. Schwing, R. S. Zemel, et al. Learning deep parsimonious representations. In NIPS, pages 5076–5084. 2016.
    Google ScholarLocate open access versionFindings
  • Yang, B., X. Fu, N. D. Sidiropoulos, et al. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In ICML, pages 3861–3870. 2017.
    Google ScholarLocate open access versionFindings
  • Chang, J., L. Wang, G. Meng, et al. Deep adaptive image clustering. In ICCV, pages 5880–5888. 2017.
    Google ScholarLocate open access versionFindings
  • Ji, X., J. F. Henriques, A. Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, pages 9865–9874. 2019.
    Google ScholarLocate open access versionFindings
  • Caron, M., P. Bojanowski, A. Joulin, et al. Deep clustering for unsupervised learning of visual features. In ECCV, pages 139–156. 2018.
    Google ScholarLocate open access versionFindings
  • Pathak, D., P. Krähenbühl, J. Donahue, et al. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544. 2016.
    Google ScholarLocate open access versionFindings
  • Zhang, R., P. Isola, A. A. Efros. Colorful image colorization. In ECCV, pages 649–666. 2016.
    Google ScholarLocate open access versionFindings
  • Zhang, R., P. Isola, A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, pages 1058–1067. 2017.
    Google ScholarLocate open access versionFindings
  • Doersch, C., A. Gupta, A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430. 2015.
    Google ScholarLocate open access versionFindings
  • Noroozi, M., P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pages 69–84. 2016.
    Google ScholarLocate open access versionFindings
  • Dosovitskiy, A., J. T. Springenberg, M. A. Riedmiller, et al. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766–774. 2014.
    Google ScholarLocate open access versionFindings
  • Gidaris, S., P. Singh, N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR. 2018.
    Google ScholarLocate open access versionFindings
  • Caron, M., P. Bojanowski, J. Mairal, et al. Unsupervised pre-training of image features on non-curated data. In ICCV, pages 2959–2968. 2019.
    Google ScholarLocate open access versionFindings
  • Zhang, L., G. Qi, L. Wang, et al. AET vs. AED: unsupervised representation learning by auto-encoding transformations rather than data. In CVPR. 2019.
    Google ScholarFindings
  • Deng, J., W. Dong, R. Socher, et al. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. 2009.
    Google ScholarLocate open access versionFindings
  • Ross, B. C. Mutual information between discrete and continuous data sets. PloS one, 9(2), 2014.
    Google ScholarLocate open access versionFindings
  • Snell, J., K. Swersky, R. S. Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4077–4087.
    Google ScholarLocate open access versionFindings
  • Chen, X., H. Fan, R. Girshick, et al. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
    Findings
  • Hénaff, O. J., A. Razavi, C. Doersch, et al. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.
    Findings
  • He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In CVPR, pages 770–778.
    Google ScholarLocate open access versionFindings
  • Johnson, J., M. Douze, H. Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
    Findings
  • Goyal, P., D. Mahajan, A. Gupta, et al. Scaling and benchmarking self-supervised visual representation learning. In ICCV, pages 6391–6400. 2019.
    Google ScholarLocate open access versionFindings
  • Zhou, B., À. Lapedriza, J. Xiao, et al. Learning deep features for scene recognition using places database. In NIPS, pages 487–495. 2014.
    Google ScholarLocate open access versionFindings
  • Everingham, M., L. V. Gool, C. K. I. Williams, et al. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
    Google ScholarLocate open access versionFindings
  • Zhai, X., A. Oliver, A. Kolesnikov, et al. S4l: Self-supervised semi-supervised learning. In ICCV, pages 1476–1485. 2019.
    Google ScholarLocate open access versionFindings
  • Miyato, T., S. Maeda, M. Koyama, et al. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 41(8):1979–1993, 2019.
    Google ScholarLocate open access versionFindings
  • Donahue, J., K. Simonyan. Large scale adversarial representation learning. In NeurIPS, pages 10541–10551. 2019.
    Google ScholarLocate open access versionFindings
  • Asano, Y. M., C. Rupprecht, A. Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR. 2020.
    Google ScholarFindings
  • Lim, S., I. Kim, T. Kim, et al. Fast autoaugment. In NeurIPS, pages 6662–6672. 2019.
    Google ScholarLocate open access versionFindings
  • Ren, S., K. He, R. B. Girshick, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99. 2015.
    Google ScholarLocate open access versionFindings
  • He, K., R. B. Girshick, P. Dollár. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
    Findings
  • Fan, R., K. Chang, C. Hsieh, et al. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–
    Google ScholarLocate open access versionFindings
  • Chen, K., J. Wang, J. Pang, et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
    Findings
  • Nguyen, X. V., J. Epps, J. Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res., 11:2837–2854, 2010.
    Google ScholarLocate open access versionFindings
  • Maaten, L. v. d., G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9:2579–
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments