What makes for good views for contrastive learning

NeurIPS, 2020.

Cited by: 0|Bibtex|Views174
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
One color space learned from RGB happens to touch the sweet spot, but in general the INCE between views is overly decreased. e reverse-U shape trend holds for both non-volume preserving and volume preserving models

Abstract:

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use empirical analysis to better understand the importance of view selec...More

Code:

Data:

0
Introduction
  • It is commonsense that how you look at an object does not change its identity. Jorge Luis Borges imagined the alternative.
  • E curse of Funes is that he has a perfect memory, and every new way he looks at the world reveals a percept minutely distinct from anything he has seen before.
  • He cannot collate the disparate experiences.
  • A popular paradigm is contrastive multiview learning, where two views of the same scene are brought together in representation space, and two views of di erent scenes are pushed apart
Highlights
  • It is commonsense that how you look at an object does not change its identity
  • One color space learned from RGB happens to touch the sweet spot, but in general the INCE between views is overly decreased. e reverse-U shape trend holds for both non-volume preserving (NVP) and volume preserving (VP) models
  • We conjecture that while reducing mutual information (MI) between views in such an unsupervised manner, the view generator has no knowledge about task-relevant semantics and construct views that do not share su cient information about the label y, i.e., the constraint I(v1, y) = I(v2, y) = I(x, y) in Proposition 4.1 is not satis ed
  • On top of the “RA-CJ-Blur” augmentations shown in Figure 10, we further reduce the mutual information of views by using PIRL [46], i.e., adding JigSaw [49]. is improves the accuracy of the linear classi er from 63.6% to 65.9%
  • We have proposed an InfoMin principle and a view synthesis framework for constructing e ective views for contrastive representation learning
  • Viewing data augmentation as information minimization, we achieved a new state-of-the-art result on the ImageNet linear readout benchmark with a ResNet-50
Results
  • The authors plot the INCE between the learned views and the corresponding linear evaluation performance.
  • One color space learned from RGB happens to touch the sweet spot, but in general the INCE between views is overly decreased.
  • Replacing the widely-used linear projection head [73, 65, 26] with a 2-layer MLP [8] increases the accuracy to 67.3%
  • When using this nonlinear projection head, the authors found a larger temperature is bene cial for downstream linear readout.
  • All these numbers are obtained with 100 epochs of pre-training.
  • Compared to SimCLR requiring 128 TPUs for large batch training, the model can be trained with as less as 4 GPUs on a single machine
Conclusion
  • The authors have proposed an InfoMin principle and a view synthesis framework for constructing e ective views for contrastive representation learning.
  • Viewing data augmentation as information minimization, the authors achieved a new state-of-the-art result on the ImageNet linear readout benchmark with a ResNet-50
Summary
  • Introduction:

    It is commonsense that how you look at an object does not change its identity. Jorge Luis Borges imagined the alternative.
  • E curse of Funes is that he has a perfect memory, and every new way he looks at the world reveals a percept minutely distinct from anything he has seen before.
  • He cannot collate the disparate experiences.
  • A popular paradigm is contrastive multiview learning, where two views of the same scene are brought together in representation space, and two views of di erent scenes are pushed apart
  • Objectives:

    While there are many ways to construct views, the goal is to analyze a set of reproducible experiments.
  • Results:

    The authors plot the INCE between the learned views and the corresponding linear evaluation performance.
  • One color space learned from RGB happens to touch the sweet spot, but in general the INCE between views is overly decreased.
  • Replacing the widely-used linear projection head [73, 65, 26] with a 2-layer MLP [8] increases the accuracy to 67.3%
  • When using this nonlinear projection head, the authors found a larger temperature is bene cial for downstream linear readout.
  • All these numbers are obtained with 100 epochs of pre-training.
  • Compared to SimCLR requiring 128 TPUs for large batch training, the model can be trained with as less as 4 GPUs on a single machine
  • Conclusion:

    The authors have proposed an InfoMin principle and a view synthesis framework for constructing e ective views for contrastive representation learning.
  • Viewing data augmentation as information minimization, the authors achieved a new state-of-the-art result on the ImageNet linear readout benchmark with a ResNet-50
Tables
  • Table1: We study how mutual information shared by views I(v1; v2) would a ect the representation quality. We evaluate the learned representation on three downstream tasks: digit classi cation, background (STL-10) classi cation, and digit localization
  • Table2: Comparison of di erent view generators by measuring STL-10 classi cation accuracy: supervised, unsupervised, and semi-supervised
  • Table3: Switching to larger backbones with views learned by the semi-supervised method
  • Table4: Single-crop ImageNet accuracies (%) of linear classi ers [<a class="ref-link" id="c77" href="#r77">77</a>] trained on representations learned with di erent contrastive methods using ResNet-50 [<a class="ref-link" id="c28" href="#r28">28</a>]. InfoMin Aug. refers to data augmentation using RandomResizedCrop, Color Ji ering, Gaussian Blur, RandAugment, Color Dropping, and a JigSaw branch as in PIRL [<a class="ref-link" id="c46" href="#r46">46</a>]. * indicates spli ing the network into two halves
  • Table5: Results of object detection and instance segmentation ne-tuned on COCO. We adopt Mask R-CNN R50-FPN, and report the bounding box AP and mask AP on val2017. In the brackets are the gaps to the ImageNet supervised pre-training counterpart. For fair comparison, InstDis [<a class="ref-link" id="c73" href="#r73">73</a>], PIRL [<a class="ref-link" id="c46" href="#r46">46</a>], MoCo [<a class="ref-link" id="c26" href="#r26">26</a>], and InfoMin are all pre-trained for 200 epochs
  • Table6: Pascal VOC object detection. All contrastive models are pretrained for 200 epochs on ImageNet for fair comparison. We use Faster R-CNN R50-C4 architecture for object detection. APs are reported using the average of 5 runs. * we use numbers from [<a class="ref-link" id="c26" href="#r26">26</a>] since the se ing is exactly the same
  • Table7: COCO object detection and instance segmentation. R50-C4. In the brackets are the gaps to the ImageNet supervised pre-training counterpart. In green are gaps of ≥ 0.5 point. * numbers are from [<a class="ref-link" id="c26" href="#r26">26</a>] since we use exactly the same ne-tuning se ing
  • Table8: COCO object detection and instance segmentation. R50-FPN. In the brackets are the gaps to the ImageNet supervised pre-training counterpart. In green are gaps of ≥ 0.5 point
  • Table9: COCO object detection and instance segmentation. R101-C4. In the brackets are the gaps to the ImageNet supervised pre-training counterpart
  • Table10: COCO object detection and instance segmentation. R101-FPN. In the brackets are the gaps to the ImageNet supervised pre-training counterpart
  • Table11: COCO object detection and instance segmentation. Cascade R101-FPN. In the brackets are the gaps to the ImageNet supervised pre-training counterpart
  • Table12: COCO object detection and instance segmentation. X101-FPN. In the brackets are the gaps to the ImageNet supervised pre-training counterpart
  • Table13: COCO object detection and instance segmentation. X152-FPN. In the brackets are the gaps to the ImageNet supervised pre-training counterpart. Supervised model is pre-trained on ImageNet-5K, while InfoMin model is only pre-trained on ImageNet-1K
  • Table14: Single-crop ImageNet accuracies (%) of linear classi ers [<a class="ref-link" id="c77" href="#r77">77</a>] trained on representations learned with di erent methods using various architectures
Download tables as Excel
Related work
  • Learning high-level representations of data that can be used to predict labels of interest is a wellstudied problem in machine learning [5]. In recent years, the most competitive methods for learning representations without labels have been self-supervised contrastive representation learning [50, 30, 73, 65, 61, 8]. ese methods use neural networks to learn a low-dimensional embedding of data by a “contrastive” loss which pushes apart dissimilar data pairs while pulling together similar pairs, an idea similar to exemplar learning [19]. Models based on contrastive losses have signi cantly outperformed other approaches based on generative models, smoothness regularization, dense prediction [78, 37, 52, 65], and adversarial losses [18].

    e core idea of contrastive representation learning is to learn a function (modeled by a deep network) that maps semantically nearby points (positive pairs) closer together in the embedding space, while pushing apart points that are dissimilar (negative pairs). One of the major design choices in contrastive learning is how to select the positive and negative pairs. For example, given a dataset of i.i.d. images, how can we synthesize positive and negative pairs?

    e standard approach for generating positive pairs without additional annotations is to create multiple views of each datapoint. For example: spli ing an image into luminance and chrominance [65], applying di erent random crops and data augmentations [73, 8, 4, 26, 75, 62], pasting an object into di erent backgrounds [79], using di erent timesteps within a video sequence [50, 80, 57, 25, 24], or using di erent patches within a single image [32, 50, 30]. Negative pairs can be generated by using views that come from randomly chosen images/patches/videos. In this work, we provide experimental evidence and analysis that can be used to guide the selection and learning of views.
Funding
  • As a by-product, we also achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readoff with a ResNet-50)
  • • Applying our understanding to achieve state of the art accuracy of 73.0% on the ImageNet linear readout benchmark with a ResNet-50
  • Using 0.08 can lead to more than 1% drop in accuracy compared to the optimal 0.2 when a nonlinear projection head is applied
  • On top of the “RA-CJ-Blur” augmentations shown in Figure 10, we further reduce the mutual information (or enhance the invariance) of views by using PIRL [46], i.e., adding JigSaw [49]. is improves the accuracy of the linear classi er from 63.6% to 65.9%
  • Replacing the widely-used linear projection head [73, 65, 26] with a 2-layer MLP [8] increases the accuracy to 67.3%
  • Viewing data augmentation as information minimization, we achieved a new state-of-the-art result on the ImageNet linear readout benchmark with a ResNet-50
Study subjects and analysis
cases: 3
In other words, we should remove task-irrelevant information between views. In this section, we will rst discuss a hypothesis for e ect of I(v1; v2) on downstream transfer performance, and then empirically analyze three cases of reducing I(v1; v2) in practice. 5.1 ree Regimes of Information Captured

Reference
  • Eirikur Agustsson and Radu Timo e. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition Workshops, pages 126–135, 2017. 9
    Google ScholarLocate open access versionFindings
  • Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bo leneck. arXiv preprint arXiv:1612.00410, 2016. 2, 8
    Findings
  • Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018. 2
    Google ScholarLocate open access versionFindings
  • Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019. 3, 4
    Findings
  • Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pa ern analysis and machine intelligence, 35(8):1798–1828, 2013. 2
    Google ScholarLocate open access versionFindings
  • Jorge Luis Borges. Funes, the memorious. na, 1962. 1
    Google ScholarFindings
  • Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pa ern recognition, 2018. 29
    Google ScholarLocate open access versionFindings
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geo rey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. 2, 3, 4, 14, 15, 16, 17, 30, 31
    Findings
  • Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 16, 27, 31
    Findings
  • Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. In ICASSP, 2019. 4
    Google ScholarLocate open access versionFindings
  • Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on arti cial intelligence and statistics, pages 215–223, 206 [12] omas M Cover and Joy A omas. Entropy, relative entropy and mutual information. Elements of information theory, 2:1–55, 1991. 23
    Google ScholarLocate open access versionFindings
  • [13] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and oc V Le. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019. 14, 15
    Findings
  • [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pa ern recognition, 2009. 5
    Google ScholarFindings
  • [15] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2011
    Findings
  • [16] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016. 11
    Findings
  • [17] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015. 31
    Google ScholarLocate open access versionFindings
  • [18] Je Donahue and Karen Simonyan. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pages 10541–10551, 2019. 2, 31
    Google ScholarLocate open access versionFindings
  • [19] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and omas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014. 2, 31
    Google ScholarLocate open access versionFindings
  • [20] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. e pascal visual object classes (voc) challenge. International journal of computer vision, 2010. 17
    Google ScholarLocate open access versionFindings
  • [21] Ian Fischer. e conditional entropy bo leneck. arXiv preprint arXiv:2002.05379, 208
    Findings
  • [22] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018. 31
    Findings
  • [23] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 12, 26
    Google ScholarLocate open access versionFindings
  • [24] Daniel Gordon, Kiana Ehsani, Dieter Fox, and Ali Farhadi. Watching the world go by: Representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990, 2020. 3
    Findings
  • [25] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In ICCV Workshop, 2019. 3
    Google ScholarLocate open access versionFindings
  • [26] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019. 3, 4, 14, 16, 17, 26, 27, 30, 31
    Findings
  • [27] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2017. 17
    Google ScholarLocate open access versionFindings
  • [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pa ern recognition, 2016. 17
    Google ScholarLocate open access versionFindings
  • [29] Olivier J Hena, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-e cient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019. 3, 4, 17, 31
    Findings
  • [30] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. 2, 3, 25
    Google ScholarLocate open access versionFindings
  • [31] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 1997. 6
    Google ScholarLocate open access versionFindings
  • [32] Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H. Adelson. Learning visual groups from co-occurrences in space and time. International Conference on Learning Representations (ICLR), Workshop track, 2016. 3
    Google ScholarLocate open access versionFindings
  • [33] Anil K Jain. Fundamentals of digital image processing. Englewood Cli s, NJ: Prentice Hall,, 1989. 12
    Google ScholarFindings
  • [34] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020. 3, 5
    Findings
  • [35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 26
    Findings
  • [36] Durk P Kingma and Prafulla Dhariwal. Glow: Generative ow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, 2018. 11
    Google ScholarLocate open access versionFindings
  • [37] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2
    Findings
  • [38] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pa ern Recognition, 2019. 31
    Google ScholarLocate open access versionFindings
  • [39] Tianhao Li and Limin Wang. Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691, 2020. 4
    Findings
  • [40] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pa ern recognition, 2017. 17
    Google ScholarLocate open access versionFindings
  • [41] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microso coco: Common objects in context. In European conference on computer vision, 2014. 17
    Google ScholarLocate open access versionFindings
  • [42] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988. 2
    Google ScholarLocate open access versionFindings
  • [43] David McAllester and Karl Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251, 2018. 3
    Findings
  • [44] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. arXiv preprint arXiv:1912.06430, 2019. 4
    Findings
  • [45] Ma hias Minderer, Olivier Bachem, Neil Houlsby, and Michael Tschannen. Automatic shortcut removal for self-supervised representation learning. arXiv preprint arXiv:2002.08822, 2020. 5
    Findings
  • [46] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991, 2019. 14, 16, 17, 30, 31
    Findings
  • [47] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020. 4
    Findings
  • [48] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 9
    Google ScholarLocate open access versionFindings
  • [49] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016. 16, 31
    Google ScholarLocate open access versionFindings
  • [50] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 3, 4, 14
    Findings
  • [51] Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191– 1253, 2003. 3
    Google ScholarLocate open access versionFindings
  • [52] Deepak Pathak, Philipp Krahenbuhl, Je Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pa ern recognition, pages 2536–2544, 2016. 2
    Google ScholarLocate open access versionFindings
  • [53] Mandela Patrick, Yuki M Asano, Ruth Fong, Joao F Henriques, Geo rey Zweig, and Andrea Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298, 2020. 4
    Findings
  • [54] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, 2018. 17
    Google ScholarLocate open access versionFindings
  • [55] Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922, 2019. 3, 12
    Findings
  • [56] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 2015. 26
    Google ScholarLocate open access versionFindings
  • [57] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018. 3
    Google ScholarLocate open access versionFindings
  • [58] Ohad Shamir, Sivan Sabato, and Na ali Tishby. Learning and generalization with the information bo leneck. eoretical Computer Science, 411(29-30):2696–2711, 2010. 8
    Google ScholarLocate open access versionFindings
  • [59] Eero P Simoncelli. 4.7 statistical modeling of photographic images. Handbook of Video and Image Processing, 2005. 9
    Google ScholarLocate open access versionFindings
  • [60] Stefano Soa o and Alessandro Chiuso. Visual representations: De ning properties and deep approximations. In ICLR, 2016. 2, 5
    Google ScholarLocate open access versionFindings
  • [61] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • [62] Aravind Srinivas, Michael Laskin, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136, 2020. 3
    Findings
  • [63] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, 2015. 5, 25
    Google ScholarLocate open access versionFindings
  • [64] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019. 4
    Findings
  • [65] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019. 2, 3, 4, 8, 9, 14, 16, 17, 25, 30, 31
    Findings
  • [66] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In ICLR, 2020. 4
    Google ScholarFindings
  • [67] Na ali Tishby, Fernando C Pereira, and William Bialek. e information bo leneck method. arXiv preprint physics/0004057, 2000. 2, 7, 8
    Google ScholarFindings
  • [68] Michael Tschannen, Josip Djolonga, Marvin Ri er, Aravindh Mahendran, Neil Houlsby, Sylvain Gelly, and Mario Lucic. Self-supervised learning of video-induced visual invariances. arXiv preprint arXiv:1912.02783, 2019. 2
    Findings
  • [69] Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019. 3
    Findings
  • [70] Mike Wu, Chengxu Zhuang, Daniel Yamins, and Noah Goodman. On the importance of views in unsupervised representation learning. preprint, 2020. 3
    Google ScholarLocate open access versionFindings
  • [71] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.17
    Findings
  • [72] Zhirong Wu, Alexei A Efros, and Stella X Yu. Improving generalization via scalable neighborhood component analysis. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • [73] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, pages 3733–3742, 2018. 2, 3, 4, 14, 15, 16, 17, 31
    Google ScholarLocate open access versionFindings
  • [74] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pa ern recognition, 2017. 18
    Google ScholarLocate open access versionFindings
  • [75] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, 2019. 3
    Google ScholarLocate open access versionFindings
  • [76] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semisupervised learning. In Proceedings of the IEEE international conference on computer vision, pages 1476–1485, 2019. 3
    Google ScholarLocate open access versionFindings
  • [77] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016. 17, 31
    Google ScholarLocate open access versionFindings
  • [78] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, pages 1058–1067, 2017. 2, 9
    Google ScholarLocate open access versionFindings
  • [79] Nanxuan Zhao, Zhirong Wu, Rynson WH Lau, and Stephen Lin. Distilling localization for selfsupervised representation learning. arXiv preprint arXiv:2004.06638, 2020. 3
    Findings
  • [80] Chengxu Zhuang, Alex Andonian, and Daniel Yamins. Unsupervised learning from video with deep neural embeddings. arXiv preprint arXiv:1905.11954, 2019. 3
    Findings
  • [81] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. arXiv preprint arXiv:1903.12355, 2019. 17, 31 (2) Multivariate Mutual Information: I(x1; x2;...; xn+1) = I(x1;...; xn) − I(x1;...; xn|xn+1)
    Findings
  • [9] InfoMin Aug.
    Google ScholarFindings
  • [7] Mask R-CNN with R-101 FPN backbone are shown in Table 11. We experimented with 1x, 2x, and 6x schedule.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments