Contrastive Multiview Coding

CoRR, 2019.

Cited by: 75|Bibtex|Views104
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We show connections to mutual information maximization and extend it to scenarios including more than two views

Abstract:

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, viewed by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a ``dog" c...More

Code:

Data:

0
Introduction
  • A foundational idea in coding theory is to learn compressed representations that can be used to reconstruct the raw data.
  • The contrastive objective in the formulation, as in CPC, can be understood as attempting to maximize the mutual information between the representations of each view.
  • The authors apply contrastive learning to the multiview setting, attemping to maximize mutual information between representations of different views of the same scene.
Highlights
  • A foundational idea in coding theory is to learn compressed representations that can be used to reconstruct the raw data
  • We show connections to mutual information maximization and extend it to scenarios including more than two views
  • V1 might be the luminance of a particular image and V2 the chrominance
  • We have found that the m + 1-way softmax classification approach performed worse than our Noise-Contrastive Estimation (NCE)-based approximation, given the same number of noise samples
  • We extensively evaluate Contrastive Multiview Coding (CMC) on a number of datasets and tasks
  • On the Imagenet linear readoff benchmark, we achieve 68.4% top-1 accuracy
  • We further extend our CMC learning framework to multiview scenarios
Results
  • Similar to recent setups for contrastive learning (Oord et al, 2018; Gutmann & Hyvarinen, 2010; Mnih & Kavukcuoglu, 2013), the authors train this function to correctly select a single positive sample x out of a set S = {x, y1, y2, ..., yk} that contains k negative samples: Lcontrast log hθ (x)
  • Following Zhang et al (2016), the authors evaluate task generalization of the learned representation by training 1000-way linear classifiers on top of different layers.
  • Though in the unsupervised stage the authors only use 1.3K images from a dataset much different from the target dataset STL-10, the object recognition accuracy is close to the supervised method, which uses an end-to-end deep network directly trained on STL-10.
  • Given two views V1 and V2 of the data, the predictive learning approach approximately models p(v2|v1).
  • This paper argues for precisely the opposite idea: that cross-view representation learning is effective because it results in a kind of information minimization, discarding nuisance factors that are not shared between the views.
  • The idea behind CMC is that this can be achieved by doing infomax learning on two views that share signal but have independent noise.
  • These methods tend to learn representations that focus on low-level variations in the data, which are not very useful from the perspective of downstream tasks such as object recognition.
Conclusion
  • Time contrastive networks (Sermanet et al, 2017) use a triplet loss framework to learn representations from aligned video sequences of the same scene, taken by different video cameras.
  • The authors' technical method is highly related, but differs in the following ways: the authors extend the objective to the case of more than two views; and the authors use a loss function which more closely follows the original method of noise contrastive estimation (Gutmann & Hyvarinen, 2010) (See details in Section 2.4).
  • CPC, Deep InfoMax, and the present paper are all very similar at the mathematical level, they each explore a different set of view definitions, architectures, and application settings, and each contributes its own unique empirical investigation of this paradigm of representation learning.
Summary
  • A foundational idea in coding theory is to learn compressed representations that can be used to reconstruct the raw data.
  • The contrastive objective in the formulation, as in CPC, can be understood as attempting to maximize the mutual information between the representations of each view.
  • The authors apply contrastive learning to the multiview setting, attemping to maximize mutual information between representations of different views of the same scene.
  • Similar to recent setups for contrastive learning (Oord et al, 2018; Gutmann & Hyvarinen, 2010; Mnih & Kavukcuoglu, 2013), the authors train this function to correctly select a single positive sample x out of a set S = {x, y1, y2, ..., yk} that contains k negative samples: Lcontrast log hθ (x)
  • Following Zhang et al (2016), the authors evaluate task generalization of the learned representation by training 1000-way linear classifiers on top of different layers.
  • Though in the unsupervised stage the authors only use 1.3K images from a dataset much different from the target dataset STL-10, the object recognition accuracy is close to the supervised method, which uses an end-to-end deep network directly trained on STL-10.
  • Given two views V1 and V2 of the data, the predictive learning approach approximately models p(v2|v1).
  • This paper argues for precisely the opposite idea: that cross-view representation learning is effective because it results in a kind of information minimization, discarding nuisance factors that are not shared between the views.
  • The idea behind CMC is that this can be achieved by doing infomax learning on two views that share signal but have independent noise.
  • These methods tend to learn representations that focus on low-level variations in the data, which are not very useful from the perspective of downstream tasks such as object recognition.
  • Time contrastive networks (Sermanet et al, 2017) use a triplet loss framework to learn representations from aligned video sequences of the same scene, taken by different video cameras.
  • The authors' technical method is highly related, but differs in the following ways: the authors extend the objective to the case of more than two views; and the authors use a loss function which more closely follows the original method of noise contrastive estimation (Gutmann & Hyvarinen, 2010) (See details in Section 2.4).
  • CPC, Deep InfoMax, and the present paper are all very similar at the mathematical level, they each explore a different set of view definitions, architectures, and application settings, and each contributes its own unique empirical investigation of this paradigm of representation learning.
Tables
  • Table1: Top-1 classification accuracy on 1000 classes of ImageNet Deng et al (2009) with single crop. We compare our CMC method with other unsupervised representation learning approaches by training 1000-way logistic regression classifiers on top of the feature maps of each layer, as proposed by Zhang et al (2016). Methods marked with † only have half the number of parameters compared to others, because of splitting
  • Table2: Single crop top-1 classification accuracy on ImageNet. We evaluate CMC with ResNet-50, ResNet-101, or ResNet-50 x2 as encoder for each of the two views (L and ab)
  • Table3: Results on the task of predicting semantic labels from L channel representation which is learnt using the patch-based contrastive loss and all 4 views. We compare CMC with Random and Supervised baselines, which serve as lower and upper bounds respectively. Th core-view paradigm refers to Fig. 3(a), and full-view
  • Table4: We compare predictive learning with contrastive learning by evaluating the learned encoder on unseen dataset and task. The contrastive learning framework consistently outperforms predictive learning
  • Table5: Classification accuracies on STL-10 by using a two layer MLP as classifier for evaluating the representations learned by a small AlexNet. For all methods we compare against, we include the numbers that are reported in the DIM (<a class="ref-link" id="cHjelm_et+al_2019_a" href="#rHjelm_et+al_2019_a">Hjelm et al, 2019</a>) paper, except for SplitBrain, which is our reimplementation. Methods marked with † have half the number of parameters because of splitting
  • Table6: Test accuracy (%) on UCF-101 which evaluates task transferability and on HMDB-51 which evaluates task and dataset transferability. Most methods either use single RGB view or additional optical flow view, while VGAN explores sound as the second view. * indicates different network architecture
  • Table7: Performance on the task of using single view v to predict the semantic labels, where v can be L, ab, depth or surface normal. Our CMC framework improves the quality of unsupervised representations towards that of supervised ones, for all of views investigated. This uses the full-graph paradigm
  • Table8: The variant of AlexNet architecture used in our CMC for STL-10 (only half is present here due to splitting). X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; P padding; * channel size is dependent on the input source, e.g. 1 for L channel and 2 for ab channel
  • Table9: AlexNet architecture used in CMC for ImageNet (only half is present here due to splitting). X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; P padding; * channel size is dependent on the input source, e.g. 1 for L channel and 2 for ab channel
  • Table10: Encoder architecture used in our CMC for playing with different views on NYU Depth-V2. X spatial resolution of layer, C number of channels in layer; K conv or pool kernel size; S computation stride; P padding; * channel size is dependent on the input source, e.g. 1 for L, 2 for ab, 1 for depth, 3 for surface normal, and 1 for segmentation map
Related work
  • Unsupervised representation learning is about learning transformations of the data that make subsequent problem solving easier (Bengio et al, 2013). This field has a long history, starting with classical methods with well established algorithms, such as principal components analysis (PCA (Jolliffe, 2011)) and independent components analysis (ICA (Hyvarinen et al, 2004)). These methods tend to learn representations that focus on low-level variations in the data, which are not very useful from the perspective of downstream tasks such as object recognition.

    Representations better suited to such tasks have been learnt using deep neural networks, starting with seminal techniques such as Boltzmann machines (Smolensky, 1986; Salakhutdinov & Hinton, 2009), autoencoders (Hinton & Salakhutdinov, 2006), variational autoencoders (Kingma & Welling, 2013), generative adversarial networks (Goodfellow et al, 2014) and autoregressive models (Oord et al, 2016). Numerous other works exist, for a review see (Bengio et al, 2013). A powerful family of models for unsupervised representations are collected under the umbrella of “self-supervised” learning (Sa, 2004; Zhang et al, 2017; 2016; Isola et al, 2015; Wang & Gupta, 2015; Pathak et al, 2016; Zhang et al, 2019). In these models, an input X to the model is transformed into an output X , which is supposed to be close to another signal Y , which itself is related to X in some meaningful way. Examples of such X/Y pairs are: luminance and chrominance color channels of an image (Zhang et al, 2017), patches from a single image (Oord et al, 2018), modalities such as vision and sound (Owens et al, 2016) or the frames of a video (Wang & Gupta, 2015). Clearly, such examples are numerous in the world, and provides us with nearly infinite amounts of training data: this is one of the appeals of this paradigm. Time contrastive networks (Sermanet et al, 2017) use a triplet loss framework to learn representations from aligned video sequences of the same scene, taken by different video cameras. Closely related to self-supervised learning is the idea of multi-view learning, which is a general term involving many different approaches such as co-training (Blum & Mitchell, 1998), multi-kernel learning (Cortes et al, 2009) and metric learning (Bellet et al, 2012; Zhuang et al, 2019); for comprehensive surveys please see (Xu et al, 2013; Li et al, 2018). Nearly all existing works have dealt with one or two views such as video or image/sound. However, in many situations, many more views are available to provide training signals for any representation.
Funding
  • On the Imagenet linear readoff benchmark, we achieve 68.4% top-1 accuracy
  • On several benchmark tasks, our method achieves state of the art results, compared to other methods for self-supervised representation learning
  • • Our approach yields representations that outperform the state-of-the-art in self-supervised learning in head-to-head comparisons
  • In the ImageNet linear readoff evaluation, we achieve 68.4% top-1 accuracy, which is slightly above the state of the art concurrent work Bachman et al (2019)
Study subjects and analysis
negative pairs: 4096
For the memory-based CMC model, we adopt ideas from Wu et al (2018) for computing and storing a memory. We retrieve 4096 negative pairs from the memory bank to contrast each positive pair. The training details are present in Sec

view pairs: 3
To further understand this, we go beyond chrominance (ab), and try to answer this question when geometry or semantic labels are present. We consider three view pairs on the NYU-Depth dataset: (1) L and depth, (2) L and surface normals, and (3) L and segmentation map. For each of them, we train two identical encoders for L, one using contrastive learning and the other with predictive learning

Reference
  • Information Diagram - Wikipedia. https://en.wikipedia.org/wiki/Information_diagram.4
    Locate open access versionFindings
  • Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 126–135, 2017. 9
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019. 2, 6, 10, 14
    Findings
  • Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019. 2, 6
    Findings
  • Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018. 9
    Findings
  • Aurelien Bellet, Amaury Habrard, and Marc Sebban. Similarity learning for provably accurate sparse linear classification. arXiv preprint arXiv:1206.6476, 2012. 10
    Findings
  • Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. 10
    Google ScholarLocate open access versionFindings
  • Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, 19910
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 517–526. JMLR. org, 2017. 16
    Google ScholarLocate open access versionFindings
  • Uta Buchler, Biagio Brattoli, and Bjorn Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–786, 2018. 17, 21
    Google ScholarLocate open access versionFindings
  • Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149, 2018. 7
    Google ScholarLocate open access versionFindings
  • Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, 2011. 9, 15
    Google ScholarLocate open access versionFindings
  • Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In Advances in neural information processing systems, pp. 396–404, 2009. 10
    Google ScholarLocate open access versionFindings
  • Hanneke EM Den Ouden, Peter Kok, and Floris P De Lange. How prediction errors shape perception, attention, and motivation. Frontiers in psychology, 3:548, 2012. 1
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 7, 15
    Google ScholarFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430, 2015. 7
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Representations, 207, 16
    Google ScholarLocate open access versionFindings
  • David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, 2015. 7
    Google ScholarLocate open access versionFindings
  • Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5589–5597, 2018. 17
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018. 7
    Findings
  • Melvyn A Goodale and A David Milner. Separate visual pathways for perception and action. Trends in neurosciences, 1992. 16
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014. 1, 10
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010. 3, 5, 10
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. 6
    Google ScholarLocate open access versionFindings
  • Olivier J Henaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019. 10
    Findings
  • Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006. 10
    Google ScholarLocate open access versionFindings
  • R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. 6, 7, 10, 14, 15, 16, 17, 18, 19
    Google ScholarLocate open access versionFindings
  • Jakob Hohwy. The predictive mind. Oxford University Press, 2013. 1
    Google ScholarFindings
  • Aapo Hyvarinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004. 10
    Google ScholarFindings
  • Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from co-occurrences in space and time. arXiv preprint arXiv:1511.06811, 2015. 10
    Findings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017. 10
    Google ScholarLocate open access versionFindings
  • Ian Jolliffe. Principal component analysis. Springer, 2011. 10
    Google ScholarFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 1, 10
    Findings
  • Philipp Krahenbuhl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015. 7
    Findings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012. 6, 14, 16, 20
    Google ScholarLocate open access versionFindings
  • Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676, 2017. 17, 21
    Google ScholarLocate open access versionFindings
  • Yingming Li, Ming Yang, and Zhongfei Mark Zhang. A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering, 2018. 2, 10
    Google ScholarLocate open access versionFindings
  • Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv preprint arXiv:1905.00397, 2019. 20
    Findings
  • Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and Li Fei-Fei. Unsupervised learning of long-term motion dynamics for videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2203–2212, 2017. 17
    Google ScholarLocate open access versionFindings
  • David McAllester and Karl Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251, 2018. 4
    Findings
  • Springer, 2016. 17, 21
    Google ScholarFindings
  • Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pp. 2265–2273, 2013. 3, 5
    Google ScholarLocate open access versionFindings
  • Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744. ACM, 2009. 17
    Google ScholarLocate open access versionFindings
  • Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 2, 7
    Google ScholarLocate open access versionFindings
  • Springer, 2016. 7
    Google ScholarFindings
  • Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5898–5906, 2017. 7
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016. 10
    Findings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 1, 3, 4, 5, 6, 10, 14, 15, 16, 18, 19
    Findings
  • Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413, 2016. 3, 10
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544, 2016. 10
    Google ScholarLocate open access versionFindings
  • Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922, 2019. 4
    Findings
  • Springer, 2015. 7
    Google ScholarFindings
  • Virginia Sa. Sensory modality segregation. In Advances in neural information processing systems, pp. 913–920, 2004. 10
    Google ScholarLocate open access versionFindings
  • Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artificial intelligence and statistics, pp. 448–455, 2009. 1, 10
    Google ScholarLocate open access versionFindings
  • Nawid Sayed, Biagio Brattoli, and Bjorn Ommer. Cross and learn: Cross-modal self-supervision. arXiv preprint arXiv:1811.03879, 2018. 17, 21
    Findings
  • Gerald E Schneider. Two visual systems. Science, 1969. 16
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-contrastive networks: Self-supervised learning from pixels. 2017. 10
    Google ScholarFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576, 2014. 16
    Google ScholarLocate open access versionFindings
  • Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005. 1
    Google ScholarLocate open access versionFindings
  • Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, Colorado Univ at Boulder Dept of Computer Science, 1986. 10
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 16
    Findings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621, 2016. 17
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802, 2015. 10
    Google ScholarLocate open access versionFindings
  • Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018. 6, 7, 10, 19
    Google ScholarLocate open access versionFindings
  • Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013. 2, 10
    Findings
  • Springer, 2007. 16
    Google ScholarFindings
  • Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2555, 2019. 7, 10
    Google ScholarLocate open access versionFindings
  • Springer, 2016. 3, 6, 7, 10
    Google ScholarFindings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1058–1067, 2017. 3, 6, 7, 8, 10, 14, 15, 16
    Google ScholarLocate open access versionFindings
  • Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. arXiv preprint arXiv:1903.12355, 2019. 10
    Findings
  • STL-10 (Coates et al., 2011) is an image recognition dataset designed for developing unsupervised or self-supervised learning algorithms. It consists of 100000 unlabeled training 96 × 96 RGB image samples and 500 labeled samples for each of the 10 classes.
    Google ScholarFindings
  • Comparison. We compare CMC with the state of the art unsupervised methods in Table 5. Three columns are shown: the conv5 and fc7 columns use respectively these layers of AlexNet as the encoder (again remembering that we split across channels for L and ab views). For these two columns we can compare against the all methods except CPC, since CPC does not report these numbers in their paper (Hjelm et al., 2019). In the Strided Crop setup, we only compare against the approaches that use contrastive learning, DIM and CPC, since this method was only used by those works. We note that in Table 5 for all the methods except SplitBrain, we report numbers are shown in the original paper. For SplitBrain, we reimplemented their model faithfully and report numbers based on our reimplementation (we verified the accuracy of our SplitBrain code by the fact that we get very similar results with our reimpementation as in the original paper (Zhang et al., 2017) for ImageNet experiments, see below).
    Google ScholarFindings
  • ImageNet (Deng et al., 2009) consists of 1000 image classes and is frequently considered as a testbed for unsupervised representation learning algorithms.
    Google ScholarFindings
  • AE NAT (Bojanowski & Joulin, 2017) BiGAN (Donahue et al., 2017) SplitBrain† (Zhang et al., 2017)
    Google ScholarFindings
  • DIM (Hjelm et al., 2019) CPC (Oord et al., 2018) CMC†(Patch) CMC†(Patch) CMC†(NCE) CMC†(NCE)
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments