On Mutual Information in Contrastive Learning for Visual Representations

Wu Mike
Wu Mike
Mosse Milan
Mosse Milan
Cited by: 0|Bibtex|Views116
Other Links: arxiv.org
Weibo:
Visual algorithms like Instance Discrimination and Local Aggregation no longer look very different from masked language modeling, as both families are unified under mutual information

Abstract:

In recent years, several unsupervised, "contrastive" learning algorithms in vision have been shown to learn representations that perform remarkably well on transfer tasks. We show that this family of algorithms maximizes a lower bound on the mutual information between two or more "views" of an image; typical views come from a compositio...More

Code:

Data:

0
Introduction
  • While supervised learning algorithms have given rise to human-level performance in several visual tasks [14, 13, 7], they require exhaustive labelled data, posing a barrier to widespread adoption.
  • The authors have seen the growth of several approaches to un-supervised learning from the vision community [18, 19, 6, 1, 2] where the aim is to uncover vector representations that are “semantically” meaningful as measured by performance on a variety of downstream visual tasks e.g., classification or object detection.
  • The core machinery behind these unsupervised algorithms is a basic concept: treat every example as its own label and perform classification as in the usual setting, the intuition being that a good representation should be able to discriminate between different examples.
Highlights
  • While supervised learning algorithms have given rise to human-level performance in several visual tasks [14, 13, 7], they require exhaustive labelled data, posing a barrier to widespread adoption
  • Later algorithms build on this basic concept either through (1) technical innovations to circumvent numerical instability [18], (2) storage innovations to hold a large number of examples in memory [6], (3) choices of data augmentation [19, 16], or (4) improvements in compute or hyperparameter choices [1]
  • We have presented an interpretation of representation learning based on mutual information between image views
  • This formulation led to more systematic understanding of a family of existing approaches
  • We uncovered that the choices of views and negative sample distribution strongly influence the performance of contrastive learning
  • We see similar gains in ImageNet where Contrastive Multiview Coding (CMC)-ANN+ achieves an accuracy of 50.5%, a difference of 2% with CMC and Local Aggregation (LA)
  • Visual algorithms like Instance Discrimination (IR) and LA no longer look very different from masked language modeling, as both families are unified under mutual information
Methods
  • The theory suggests that representation quality in increasing order to be IR, BALL, ANN, LA.
  • The authors fit each of these algorithms on ImageNet and CIFAR10.
  • Fig. 4 show the nearest neighbor classification accuracy on a test set throughout training.
  • Table 2 shows transfer classification performance: accuracy of logistic regression trained using the frozen representations learned by each of the unsupervised algorithms.
  • The authors follow the training paradigms in prior works [18, 19, 16, 1] and standardize hyperparameters across models
Results
  • On CIFAR10, ANN+ surpasses LA by 3% while CMC-ANN+ surpasses CMC by over 2%. The authors see similar gains in ImageNet where CMC-ANN+ achieves an accuracy of 50.5%, a difference of 2% with CMC and LA.
Conclusion
  • The authors have presented an interpretation of representation learning based on mutual information between image views
  • This formulation led to more systematic understanding of a family of existing approaches.
  • By choosing more difficult negative samples, the authors surpassed high-performing algorithms like LA and CMC across several popular visual tasks.
  • This framework suggests several new directions.
Summary
  • Introduction:

    While supervised learning algorithms have given rise to human-level performance in several visual tasks [14, 13, 7], they require exhaustive labelled data, posing a barrier to widespread adoption.
  • The authors have seen the growth of several approaches to un-supervised learning from the vision community [18, 19, 6, 1, 2] where the aim is to uncover vector representations that are “semantically” meaningful as measured by performance on a variety of downstream visual tasks e.g., classification or object detection.
  • The core machinery behind these unsupervised algorithms is a basic concept: treat every example as its own label and perform classification as in the usual setting, the intuition being that a good representation should be able to discriminate between different examples.
  • Methods:

    The theory suggests that representation quality in increasing order to be IR, BALL, ANN, LA.
  • The authors fit each of these algorithms on ImageNet and CIFAR10.
  • Fig. 4 show the nearest neighbor classification accuracy on a test set throughout training.
  • Table 2 shows transfer classification performance: accuracy of logistic regression trained using the frozen representations learned by each of the unsupervised algorithms.
  • The authors follow the training paradigms in prior works [18, 19, 16, 1] and standardize hyperparameters across models
  • Results:

    On CIFAR10, ANN+ surpasses LA by 3% while CMC-ANN+ surpasses CMC by over 2%. The authors see similar gains in ImageNet where CMC-ANN+ achieves an accuracy of 50.5%, a difference of 2% with CMC and LA.
  • Conclusion:

    The authors have presented an interpretation of representation learning based on mutual information between image views
  • This formulation led to more systematic understanding of a family of existing approaches.
  • By choosing more difficult negative samples, the authors surpassed high-performing algorithms like LA and CMC across several popular visual tasks.
  • This framework suggests several new directions.
Tables
  • Table1: Looseness of VINCE
  • Table2: Evaluation of the representations using six visual transfer tasks: object classification on ImageNet and CIFAR10 (a-d); object detection and instance segmentation on COCO (e,f); keypoint detection on COCO (g); object detection on Pascal VOC 2007 (h); and instance segmentation on LVIS (i). In all cases, the backbone network is a frozen pretrained ResNet-18 (R18)
Download tables as Excel
Funding
  • On CIFAR10, ANN+ surpasses LA by 3% while CMC-ANN+ surpasses CMC by over 2%
  • We see similar gains in ImageNet where CMC-ANN+ achieves an accuracy of 50.5%, a difference of 2% with CMC and LA
Study subjects and analysis
observations: 3
Finally, we note that Lemma 3.1 provides a more critical comparison between IR and CMC: as the two are functionally identical, the only differences are in how each defines their views. We make three observations: First, the view set for CMC is partitioned into two disjoint sets with a one-to-one correspondence between elements of each set (since an image is decomposed into an L and AB filter) — further, as L and AB capture almost disjoint information, CMC imposes a strong information bottleneck between any two views. In fact, Fig. 2e shows the L-ab view set to be at the apex of the curve between MI and accuracy

Reference
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
    Findings
  • Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
    Findings
  • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
    Google ScholarLocate open access versionFindings
  • Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019.
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
    Google ScholarLocate open access versionFindings
  • Lingpeng Kong, Cyprien de Masson d’Autume, Wang Ling, Lei Yu, Zihang Dai, and Dani Yogatama. A mutual information maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350, 2019.
    Findings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922, 2019.
    Findings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
    Findings
  • Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
    Findings
  • Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
    Google ScholarLocate open access versionFindings
  • Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pages 6002–6012, 20The encoders are 5-layer MLPs with 10 hidden dimensions and ReLU nonlinearities. To build the dataset, we sample 2000 points and optimize the InfoNCE objective with Adam with a learning rate of 0.03, batch size 128, and no weight decay for 100 epochs. Given a percentage for VINCE, we compute distances between all elements in the memory bank and the representation the current image — we only sample 100 negatives from the top p percent. We conduct the experiment with 5 different random seeds to estimate the variance.
    Google ScholarLocate open access versionFindings
  • All hyperparameters are as described in Sec. B.3 which the exception of the particular hyperparameter we are varying for the experiment. To compare InfoNCE and the original IR formulation, we adapted the public PyTorch implementation found at https://github.com/neuroailab/ LocalAggregation-Pytorch.
    Findings
  • We make heavy usage of the Detectron2 code found at https://github.com/
    Findings
  • In particular, the script https://github.com/
    Findings
Full Text
Your rating :
0

 

Tags
Comments