Momentum Contrast for Unsupervised Visual Representation Learning

CVPR, pp. 9726-9735, 2019.

Cited by: 228|Bibtex|Views402|Links
EI
Keywords:
pre trainingpretext taskunsupervised learningsemantic segmentationunsupervised visual representationMore(10+)
Weibo:
Momentum Contrast is on par on Cityscapes instance segmentation, and lags behind on VOC semantic segmentation; we show another comparable case on iNaturalist in appendix

Abstract:

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised lear...More

Code:

Data:

0
Introduction
  • Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50, 51] and BERT [12].
  • Supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind.
  • Language tasks have discrete signal spaces for building tokenized dictionaries, on which unsupervised learning can be based.
  • In contrast, further concerns dictionary building [54, 9, 5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication.
  • Learning is formulated as minimizing a contrastive loss [29]
Highlights
  • Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50, 51] and BERT [12]
  • Several recent studies [61, 46, 36, 66, 35, 56, 2] present promising results on unsupervised visual representation learning using approaches related to the contrastive loss [29]
  • Momentum Contrast is on par on Cityscapes instance segmentation, and lags behind on VOC semantic segmentation; we show another comparable case on iNaturalist [57] in appendix
  • Momentum Contrast has largely closed the gap between unsupervised and supervised representation learning in multiple vision tasks
  • In all these tasks, Momentum Contrast pre-trained on IG-1B is consistently better than Momentum Contrast pre-trained on IN-1M
  • As here ImageNet is the downstream task, the case of Momentum Contrast pre-trained on IN-1M does not represent a real scenario
Methods
  • Contrastive Learning as Dictionary Look-up.
  • Contrastive learning [29], and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next.
  • A contrastive loss [29] is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys.
  • With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE [46], is considered in this paper: Lq = − log exp(q·k+/τ )
Results
  • Table 5 shows the results on COCO with the FPN (Table 5a, b) and C4 (Table 5c, d) backbones.
  • With the 1× schedule, all models are heavily under-trained, as indicated by the ∼2 points gaps to the 2× schedule cases.
  • With the 2× schedule, MoCo is better than its ImageNet supervised counterpart in all metrics in both backbones.
  • Table 6 shows more downstream tasks.
  • MoCo performs competitively pre-train random init.
Conclusion
  • Discussion and Conclusion

    The authors' method has shown positive results of unsupervised learning in a variety of computer vision tasks and datasets.
  • MoCo has largely closed the gap between unsupervised and supervised representation learning in multiple vision tasks.
  • In all these tasks, MoCo pre-trained on IG-1B is consistently better than MoCo pre-trained on IN-1M.
  • This shows that MoCo can perform well on this large-scale, relatively uncurated dataset.
Summary
  • Introduction:

    Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50, 51] and BERT [12].
  • Supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind.
  • Language tasks have discrete signal spaces for building tokenized dictionaries, on which unsupervised learning can be based.
  • In contrast, further concerns dictionary building [54, 9, 5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication.
  • Learning is formulated as minimizing a contrastive loss [29]
  • Methods:

    Contrastive Learning as Dictionary Look-up.
  • Contrastive learning [29], and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next.
  • A contrastive loss [29] is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys.
  • With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE [46], is considered in this paper: Lq = − log exp(q·k+/τ )
  • Results:

    Table 5 shows the results on COCO with the FPN (Table 5a, b) and C4 (Table 5c, d) backbones.
  • With the 1× schedule, all models are heavily under-trained, as indicated by the ∼2 points gaps to the 2× schedule cases.
  • With the 2× schedule, MoCo is better than its ImageNet supervised counterpart in all metrics in both backbones.
  • Table 6 shows more downstream tasks.
  • MoCo performs competitively pre-train random init.
  • Conclusion:

    Discussion and Conclusion

    The authors' method has shown positive results of unsupervised learning in a variety of computer vision tasks and datasets.
  • MoCo has largely closed the gap between unsupervised and supervised representation learning in multiple vision tasks.
  • In all these tasks, MoCo pre-trained on IG-1B is consistently better than MoCo pre-trained on IN-1M.
  • This shows that MoCo can perform well on this large-scale, relatively uncurated dataset.
Tables
  • Table1: Comparison under the linear classification protocol on ImageNet. The figure visualizes the table. All are reported as unsupervised pre-training on the ImageNet-1M training set, followed by supervised linear classification trained on frozen features, evaluated on the validation set. The parameter counts are those of the feature extractors. We compare with improved reimplementations if available (referenced after the numbers)
  • Table2: Object detection fine-tuned on PASCAL VOC trainval07+12. Evaluation is on test2007: AP50 (default VOC metric), AP (COCO-style), and AP75, averaged over 5 trials. All are fine-tuned for 24k iterations (∼23 epochs). In the brackets are the gaps to the ImageNet supervised pre-training counterpart. In green are the gaps of at least +0.5 point
  • Table3: Comparison of three contrastive loss mechanisms on
  • Table4: Comparison with previous methods on object detection fine-tuned on PASCAL VOC trainval2007. Evaluation is on
  • Table5: Object detection and instance segmentation fine-tuned on COCO: bounding-box AP (APbb) and mask AP (APmk) evaluated on val2017. In the brackets are the gaps to the ImageNet supervised pre-training counterpart. In green are the gaps of at least +0.5 point
  • Table6: MoCo vs. ImageNet supervised pre-training, finetuned on various tasks. For each task, the same architecture and schedule are used for all entries (see appendix). In the brackets are the gaps to the ImageNet supervised pre-training counterpart. In green are the gaps of at least +0.5 point. †: this entry is with BN frozen, which improves results; see main text
Download tables as Excel
Related work
  • Unsupervised/self-supervised1 learning methods generally involve two aspects: pretext tasks and loss functions. The term “pretext” implies that the task being solved is not of genuine interest, but is solved only for the true purpose of learning a good data representation. Loss functions can often be investigated independently of pretext tasks. MoCo focuses on the loss function aspect. Next we discuss related studies with respect to these two aspects.

    Loss functions. A common way of defining a loss function is to measure the difference between a model’s prediction and a fixed target, such as reconstructing the input pixels (e.g., auto-encoders) by L1 or L2 losses, or classifying the input into pre-defined categories (e.g., eight positions [13], color bins [64]) by cross-entropy or margin-based losses. Other alternatives, as described next, are also possible.
Funding
  • Presents Momentum Contrast for unsupervised visual representation learning
  • Presents Momentum Contrast as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss
  • Shows that in 7 downstream tasks related to detection or segmentation, MoCo unsupervised pre-training can surpass its ImageNet supervised counterpart, in some cases by nontrivial margins
  • Explores MoCo pre-trained on ImageNet or on a one-billion Instagram image set, demonstrating that MoCo can work well in a more real-world, billionimage scale, and relatively uncurated scenario
Reference
  • Rıza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. DensePose: Dense human pose estimation in the wild. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv:1906.00910, 2019.
    Findings
  • Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI, 2017.
    Google ScholarLocate open access versionFindings
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv:2002.05709, 2020.
    Findings
  • Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv:2003.04297, 2020.
    Findings
  • Adam Coates and Andrew Ng. The importance of encoding versus training with sparse coding and vector quantization. In ICML, 2011.
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Carl Doersch and Andrew Zisserman. Multi-task selfsupervised visual learning. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv:1907.02544, 2019.
    Findings
  • Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, 2010.
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Ross Girshick. Fast R-CNN. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollar, and Kaiming He. Detectron, 2018.
    Google ScholarLocate open access versionFindings
  • Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
    Findings
  • Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking ImageNet pre-training. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Olivier J Henaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv:1905.09272, 2019. Updated version accessed at https://openreview.net/pdf?id=rJerHlrYwH.[36] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
    Findings
  • [37] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • [38] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • [39] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
    Google ScholarFindings
  • [40] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast AutoAugment. arXiv:1905.00397, 2019.
    Findings
  • [41] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [42] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • [43] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • [44] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [45] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • [46] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
    Findings
  • [47] Deepak Pathak, Ross Girshick, Piotr Dollar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [48] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [49] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet: A large mini-batch object detector. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [50] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
    Google ScholarFindings
  • [51] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
    Google ScholarFindings
  • [52] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • [53] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [54] Josef Sivic and Andrew Zisserman. Video Google: a text retrieval approach to object matching in videos. In ICCV, 2003.
    Google ScholarLocate open access versionFindings
  • [55] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 2016.
    Google ScholarLocate open access versionFindings
  • [56] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv:1906.05849, 2019. Updated version accessed at https://openreview.net/pdf?id=BkgStySKPB.
    Findings
  • [57] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The iNaturalist species classification and detection dataset. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [58] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
    Google ScholarLocate open access versionFindings
  • [59] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [60] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
    Findings
  • [61] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. Updated version accessed at: https://arxiv.org/abs/1805.01978v1.
    Findings
  • [62] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [63] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • [64] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • [65] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [66] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, 2019. Additional results accessed from supplementary materials.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments