Data Efficient Image Recognition with Contrastive Predictive Coding

Olivier J. Hénaff
Olivier J. Hénaff
Jeffrey De Fauw
Jeffrey De Fauw
Ali Razavi
Ali Razavi
Carl Doersch
Carl Doersch
S. M. Ali Eslami
S. M. Ali Eslami
Aaron van den Oord
Aaron van den Oord

arXiv preprint arXiv:1905.09272, 2019.

Cited by: 181|Bibtex|Views69
Weibo:
Deep neural networks excel at perceptual tasks when labeled data are abundant, yet their performance degrades substantially when provided with limited supervision

Abstract:

Human observers can learn to recognize new categories of images from a handful of examples, yet doing so with machine perception remains an open challenge. We hypothesize that data-efficient recognition is enabled by representations which make the variability in natural signals more predictable. We therefore revisit and improve Contrastiv...More

Code:

Data:

0
Introduction
  • Deep neural networks excel at perceptual tasks when labeled data are abundant, yet their performance degrades substantially when provided with limited supervision (Fig. 1, red).
  • Humans and animals can learn about new classes of images from a small number of examples (Landau et al, 1988; Markman, 1989).
  • What accounts for this monumental difference in data-efficiency between biological and machine vision?
  • The authors hypothesize that spatially predictable representations may allow artificial systems to benefit from human-like data-efficiency
Highlights
  • Deep neural networks excel at perceptual tasks when labeled data are abundant, yet their performance degrades substantially when provided with limited supervision (Fig. 1, red)
  • Humans and animals can learn about new classes of images from a small number of examples (Landau et al, 1988; Markman, 1989). What accounts for this monumental difference in data-efficiency between biological and machine vision? While highly structured representations (e.g. as proposed by Lake et al (2015)) may improve data-efficiency, it remains unclear how to program explicit structures that capture the enormous complexity of realworld visual scenes, such as those present in the ImageNet dataset (Russakovsky et al, 2015)
  • Human perceptual representations have been shown to linearize the temporal transformations found in natural videos, a property lacking from current supervised image recognition models (Henaff et al, 2019), and theories of both spatial and temporal predictability have succeeded in describing properties of early visual areas (Rao & Ballard, 1999; Palmer et al, 2015)
  • The labeled dataset Dl is a random subset of the ImageNet dataset: we investigated using 1%, 2%, 5%, 10%, 20%, 50% and 100% of the dataset
  • And Fig. 1 we report the results of this fine-tuned model. This procedure leads to a substantial increase in accuracy, yielding 78.3% Top-5 accuracy with only 1% of the labels, a 34% absolute improvement (77% relative) over purelysupervised methods
  • Since most representation learning methods have previously been evaluated using linear classification, we use this benchmark to guide a series of modifications to the training protocol and architecture and compare to published results
Methods
  • The original CPC model spatially jittered individual patches independently
  • The authors further this logic by adopting the ‘color dropping’ method of Doersch et al (2015), which randomly drops two of the three color channels in each patch, and find it to deliver systematic gains (+3% accuracy).
  • 96.2 96.5 since fine-tuned representations yield only marginal gains over fixed ones (e.g. 77.1% vs 78.3% Top-5 accuracy given 1% of the labels, see table 3), the authors train an identical ResNet classifier on top of these representations while keeping them fixed.
  • The authors find that the results are on par with or surpass even the strongest such results (Zhai et al, 2019), even though this work combines a variety of techniques with a large architecture whose capacity is similar to ours
Results
  • When testing whether CPC enables data-efficient learning, the authors wish to use the best representative of this model class.
  • Since most representation learning methods have previously been evaluated using linear classification, the authors use this benchmark to guide a series of modifications to the training protocol and architecture and compare to published results.
  • In section 4.2 the authors turn to the central question of whether CPC enables data-efficient classification.
  • In section 4.3 the authors investigate the generality of the results through transfer learning to PASCAL VOC 2007.
Conclusion
  • The authors asked whether CPC could enable data-efficient image recognition, and found that it greatly improves the accuracy of classifiers and object detectors when given small amounts of labeled data.
  • The authors' results show that there is still room for improvement using relatively straightforward changes such as augmentation, optimization, and network architecture.
  • Overall, these results open the door toward research on problems where data is naturally limited, e.g. medical imaging or robotics.
  • Contrastive prediction methods, including the techniques proposed in this paper, are task agnostic and could serve as a unifying framework for integrating these
Summary
  • Introduction:

    Deep neural networks excel at perceptual tasks when labeled data are abundant, yet their performance degrades substantially when provided with limited supervision (Fig. 1, red).
  • Humans and animals can learn about new classes of images from a small number of examples (Landau et al, 1988; Markman, 1989).
  • What accounts for this monumental difference in data-efficiency between biological and machine vision?
  • The authors hypothesize that spatially predictable representations may allow artificial systems to benefit from human-like data-efficiency
  • Methods:

    The original CPC model spatially jittered individual patches independently
  • The authors further this logic by adopting the ‘color dropping’ method of Doersch et al (2015), which randomly drops two of the three color channels in each patch, and find it to deliver systematic gains (+3% accuracy).
  • 96.2 96.5 since fine-tuned representations yield only marginal gains over fixed ones (e.g. 77.1% vs 78.3% Top-5 accuracy given 1% of the labels, see table 3), the authors train an identical ResNet classifier on top of these representations while keeping them fixed.
  • The authors find that the results are on par with or surpass even the strongest such results (Zhai et al, 2019), even though this work combines a variety of techniques with a large architecture whose capacity is similar to ours
  • Results:

    When testing whether CPC enables data-efficient learning, the authors wish to use the best representative of this model class.
  • Since most representation learning methods have previously been evaluated using linear classification, the authors use this benchmark to guide a series of modifications to the training protocol and architecture and compare to published results.
  • In section 4.2 the authors turn to the central question of whether CPC enables data-efficient classification.
  • In section 4.3 the authors investigate the generality of the results through transfer learning to PASCAL VOC 2007.
  • Conclusion:

    The authors asked whether CPC could enable data-efficient image recognition, and found that it greatly improves the accuracy of classifiers and object detectors when given small amounts of labeled data.
  • The authors' results show that there is still room for improvement using relatively straightforward changes such as augmentation, optimization, and network architecture.
  • Overall, these results open the door toward research on problems where data is naturally limited, e.g. medical imaging or robotics.
  • Contrastive prediction methods, including the techniques proposed in this paper, are task agnostic and could serve as a unifying framework for integrating these
Tables
  • Table1: Linear classification accuracy, and comparison to other self-supervised methods. In all cases the feature extractor is optimized in an unsupervised manner, using one of the methods listed below. A linear classifier is then trained on top using all labels in the ImageNet dataset, and evaluated using a single crop. Prior art reported from [1] <a class="ref-link" id="cWu_et+al_2018_a" href="#rWu_et+al_2018_a">Wu et al (2018</a>), [2] <a class="ref-link" id="cZhuang_et+al_2019_a" href="#rZhuang_et+al_2019_a">Zhuang et al (2019</a>), [3] <a class="ref-link" id="cHe_et+al_2019_a" href="#rHe_et+al_2019_a">He et al (2019</a>), [4] Misra & van der Maaten (2019), [5] <a class="ref-link" id="cDoersch_2017_a" href="#rDoersch_2017_a">Doersch & Zisserman (2017</a>), [6] <a class="ref-link" id="cKolesnikov_et+al_2019_a" href="#rKolesnikov_et+al_2019_a">Kolesnikov et al (2019</a>), [7] <a class="ref-link" id="cvan_den_Oord_et+al_2018_a" href="#rvan_den_Oord_et+al_2018_a">van den Oord et al (2018</a>), [8] <a class="ref-link" id="cDonahue_2019_a" href="#rDonahue_2019_a">Donahue & Simonyan (2019</a>), [9] <a class="ref-link" id="cBachman_et+al_2019_a" href="#rBachman_et+al_2019_a">Bachman et al (2019</a>), [10] <a class="ref-link" id="cTian_et+al_2019_a" href="#rTian_et+al_2019_a">Tian et al (2019</a>)
  • Table2: Data-efficient image classification. We compare the accuracy of two ResNet classifiers, one trained on the raw image pixels, the other on the proposed CPC v2 features, for varying amounts of labeled data. Note that we also fine-tune the CPC features for the supervised task, given the limited amount of labeled data. Regardless, the ResNet trained on CPC features systematically surpasses the one trained on pixels, even when given 2–5× less labels to learn from. The red (respectively, blue) boxes highlight comparisons between the two classifiers, trained with different amounts of data, which illustrate a 5× (resp. 2×) gain in data-efficiency in the low-data (resp. high-data) regime
  • Table3: Comparison to other methods for semi-supervised learning. Representation learning methods use a classifier to discriminate an unsupervised representation, and optimize it solely with respect to labeled data. Label-propagation methods on the other hand further constrain the classifier with smoothness and entropy criteria on unlabeled data, making the additional assumption that all training images fit into a single (unknown) testing category. When evaluating CPC v2, BigBiGAN, and AMDIM, we train a ResNet-33 on top of the representation, while keeping the representation fixed or allowing it to be fine-tuned. All other results are reported from their respective papers: [1] <a class="ref-link" id="cZhai_et+al_2019_a" href="#rZhai_et+al_2019_a">Zhai et al (2019</a>), [2] Xie et al (2019), [3] <a class="ref-link" id="cWu_et+al_2018_a" href="#rWu_et+al_2018_a">Wu et al (2018</a>), [4] Misra & van der Maaten (2019)
  • Table4: Comparison of PASCAL VOC 2007 object detection accuracy to other transfer methods. The supervised baseline learns from the entire labeled ImageNet dataset and fine-tunes for PASCAL detection. The second class of methods learns from the same unlabeled images before transferring. The architecture column specifies the object detector (Fast-RCNN or Faster-RCNN) and the feature extractor (ResNet-50, -101, -152, or -161). All of these methods pre-train on the ImageNet dataset, except for DeeperCluster which learns from the larger, but uncurated, YFCC100M dataset (<a class="ref-link" id="cThomee_et+al_2015_a" href="#rThomee_et+al_2015_a">Thomee et al, 2015</a>). All methods fine-tune on the PASCAL 2007 training set, and are evaluted in terms of mean average precision (mAP). Prior art reported from [1] <a class="ref-link" id="cDosovitskiy_et+al_2014_a" href="#rDosovitskiy_et+al_2014_a">Dosovitskiy et al (2014</a>), [2] <a class="ref-link" id="cDoersch_2017_a" href="#rDoersch_2017_a">Doersch & Zisserman (2017</a>), [3] <a class="ref-link" id="cPathak_et+al_2016_a" href="#rPathak_et+al_2016_a">Pathak et al (2016</a>), [4] Zhang et al (2016), [5] <a class="ref-link" id="cDoersch_et+al_2015_a" href="#rDoersch_et+al_2015_a">Doersch et al (2015</a>), [6] <a class="ref-link" id="cWu_et+al_2018_a" href="#rWu_et+al_2018_a">Wu et al (2018</a>), [7] <a class="ref-link" id="cCaron_et+al_2018_a" href="#rCaron_et+al_2018_a">Caron et al (2018</a>), [8] <a class="ref-link" id="cCaron_et+al_2019_a" href="#rCaron_et+al_2019_a">Caron et al (2019</a>), [9] <a class="ref-link" id="cZhuang_et+al_2019_a" href="#rZhuang_et+al_2019_a">Zhuang et al (2019</a>), [10] Misra & van der Maaten (2019) [11] <a class="ref-link" id="cHe_et+al_2019_a" href="#rHe_et+al_2019_a">He et al (2019</a>)
Download tables as Excel
Related work
  • Data-efficient learning has typically been approached by two complementary methods, both of which seek to make use of more plentiful unlabeled data: representation learning and label propagation. The former formulates an objective to learn a feature extractor fθ in an unsupervised manner, whereas the latter directly constrains the classifier hψ using the unlabeled data.

    Representation learning saw early success using generative modeling (Kingma et al, 2014), but likelihood-based models have yet to generalize to more complex stimuli. Generative adversarial models have also been harnessed for representation learning (Donahue et al, 2016), and largescale implementations have led to corresponding gains in linear classification accuracy (Donahue & Simonyan, 2019).

    In contrast to generative models which require the reconstruction of observations, self-supervised techniques directly formulate tasks involving the learned representation. For example, simply asking a network to recognize the spatial layout of an image led to representations that transferred to popular vision tasks such as classification and detection (Doersch et al, 2015; Noroozi & Favaro, 2016). Other works showed that prediction of color (Zhang et al, 2016; Larsson et al, 2017) and image orientation (Gidaris et al, 2018), and invariance to data augmentation (Dosovitskiy et al, 2014) can provide useful self-supervised tasks. Beyond single images, works have leveraged video cues such as object tracking (Wang & Gupta, 2015), frame ordering (Misra et al, 2016), and object boundary cues (Li et al, 2016; Pathak et al, 2016). Non-visual information can be equally powerful: information about camera motion (Agrawal et al, 2015; Jayaraman & Grauman, 2015), scene geometry (Zamir et al, 2016), or sound (Arandjelovic & Zisserman, 2017; 2018) can all serve as natural sources of supervision.
Funding
  • We find that we can reclaim much of batch normalization’s training efficiency by using layer normalization (+2% accuracy, Ba et al (2016))
  • Additional predictions tasks incrementally increased accuracy (adding bottom-up predictions: +2% accuracy; using all four spatial directions: +2.5% accuracy)
  • To that effect, the original CPC model spatially jittered individual patches independently. We further this logic by adopting the ‘color dropping’ method of Doersch et al (2015), which randomly drops two of the three color channels in each patch, and find it to deliver systematic gains (+3% accuracy)
  • After tuning the supervised model for lowdata classification (varying network depth, regularization, and optimization parameters) and extensive use of dataaugmentation (including the transformations used for CPC pre-training), the accuracy of the best model reaches 44.1% Top-5 accuracy when trained on 1% of the dataset (compared to 95.2% when trained on the entire dataset, see Table 2 and Fig. 1, red)
  • In Table 2 and Fig. 1 (blue curve) we report the results of this fine-tuned model. This procedure leads to a substantial increase in accuracy, yielding 78.3% Top-5 accuracy with only 1% of the labels, a 34% absolute improvement (77% relative) over purelysupervised methods
  • When given the entire dataset, this classifier reaches 83.4%/96.5% Top1/Top5 accuracy, surpassing our supervised baseline (ResNet-200: 80.2%/95.2% accuracy) and published results (original ResNet-200 v2: 79.9%/95.2%, He et al (2016b); with AutoAugment: 80.0%/95.0%, Cubuk et al (2018))
  • The CPC training objective is much richer and requires larger architectures to be taken advantage of, as evidenced by the difference in linear classification accuracy between a ResNet-50 and ResNet-161 trained for CPC (table 1, 63.8% vs 71.5% Top-1 accuracy)
  • 96.2 96.5 since fine-tuned representations yield only marginal gains over fixed ones (e.g. 77.1% vs 78.3% Top-5 accuracy given 1% of the labels, see table 3), we train an identical ResNet classifier on top of these representations while keeping them fixed
Study subjects and analysis
labeled datasets: 3
In all cases, the dataset of unlabeled images Du we pre-train on is the full ImageNet ILSVRC 2012 training set (Russakovsky et al, 2015). We consider three labeled datasets Dl for evaluation, each with an associated classifier hψ and supervised loss LSup (see Fig. 2, right). This protocol is sufficiently generic to allow us to later compare the CPC representation to other methods which have their own means of learning a feature extractor fθ

Reference
  • Agrawal, P., Carreira, J., and Malik, J. Learning to see by moving. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Arandjelovic, R. and Zisserman, A. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617, 2017.
    Google ScholarLocate open access versionFindings
  • Arandjelovic, R. and Zisserman, A. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451, 2018.
    Google ScholarLocate open access versionFindings
  • Ba, L. J., Kiros, R., and Hinton, G. E. Layer normalization. CoRR, abs/1607.06450, 2016.
    Findings
  • Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019.
    Findings
  • Barlow, H. Unsupervised learning. Neural Computation, 1 (3):295–311, 1989. doi: 10.1162/neco.1989.1.3.295.
    Findings
  • Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep clustering for unsupervised learning of visual features. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Caron, M., Bojanowski, P., Mairal, J., and Joulin, A. Leveraging large-scale uncurated data for unsupervised pretraining of visual features. 2019.
    Google ScholarFindings
  • Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, pp. 539–546, 2005.
    Google ScholarLocate open access versionFindings
  • Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
    Findings
  • De Fauw, J., Ledsam, J. R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., Askham, H., Glorot, X., O’Donoghue, B., Visentin, D., et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature medicine, 24(9):1342, 2018.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
    Findings
  • Doersch, C. and Zisserman, A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060, 2017.
    Google ScholarLocate open access versionFindings
  • Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430, 2015.
    Google ScholarLocate open access versionFindings
  • Donahue, J. and Simonyan, K. Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544, 2019.
    Findings
  • Donahue, J., Krahenbuhl, P., and Darrell, T. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
    Findings
  • Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp. 766–774, 2014.
    Google ScholarLocate open access versionFindings
  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes challenge 2007 (voc2007) results. 2007.
    Google ScholarFindings
  • Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
    Findings
  • Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536, 2005.
    Google ScholarLocate open access versionFindings
  • Gutmann, M. and Hyvarinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
    Google ScholarLocate open access versionFindings
  • Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735– 1742. IEEE, 2006.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a.
    Google ScholarLocate open access versionFindings
  • Springer, 2016b.
    Google ScholarFindings
  • He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
    Findings
  • Henaff, O. J., Goris, R. L., and Simoncelli, E. P. Perceptual straightening of natural videos. Nature neuroscience, 22 (6):984–991, 2019.
    Google ScholarLocate open access versionFindings
  • Hinton, G., Sejnowski, T., Sejnowski, H., and Poggio, T. Unsupervised Learning: Foundations of Neural Computation. A Bradford Book. MIT Press, 1999. ISBN 9780262581684.
    Google ScholarFindings
  • Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • Jayaraman, D. and Grauman, K. Learning image representations tied to ego-motion. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Jing, L. and Tian, Y. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018.
    Findings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589, 2014.
    Google ScholarLocate open access versionFindings
  • Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised visual representation learning. CoRR, abs/1901.09005, 2019. URL http://arxiv.org/abs/1901.09005.
    Findings
  • Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
    Google ScholarLocate open access versionFindings
  • Landau, B., Smith, L. B., and Jones, S. S. The importance of shape in early lexical learning. Cognitive development, 3(3):299–321, 1988.
    Google ScholarLocate open access versionFindings
  • Larsson, G., Maire, M., and Shakhnarovich, G. Colorization as a proxy task for visual understanding. In CVPR, pp. 6874–6883, 2017.
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
    Google ScholarLocate open access versionFindings
  • Lee, D.-H. Pseudo-label: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, pp. 2, 2013.
    Google ScholarLocate open access versionFindings
  • Li, Y., Paluri, M., Rehg, J. M., and Dollar, P. Unsupervised learning of edges. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Lim, S., Kim, I., Kim, T., Kim, C., and Kim, S. Fast autoaugment. arXiv preprint arXiv:1905.00397, 2019.
    Findings
  • Markman, E. M. Categorization and naming in children: Problems of induction. mit Press, 1989.
    Google ScholarFindings
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc., 2013.
    Google ScholarLocate open access versionFindings
  • Misra, I. and van der Maaten, L. Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991, 2019.
    Findings
  • Misra, I., Zitnick, C. L., and Hebert, M. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Miyato, T., Maeda, S.-i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Mnih, A. and Kavukcuoglu, K. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pp. 2265–2273, 2013.
    Google ScholarLocate open access versionFindings
  • Palmer, S. E., Marre, O., Berry, M. J., and Bialek, W. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908–6913, 2015.
    Google ScholarLocate open access versionFindings
  • Pathak, D., Girshick, R., Dollar, P., Darrell, T., and Hariharan, B. Learning features by watching objects move. arXiv preprint arXiv:1612.06370, 2016.
    Findings
  • Pinto, L. and Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
    Google ScholarLocate open access versionFindings
  • Pinto, L., Davidson, J., and Gupta, A. Supervision via competition: Robot adversaries for learning tasks. arXiv preprint arXiv:1610.01685, 2016.
    Findings
  • Rao, R. P. and Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptive-field effects. Nature neuroscience, 2 (1):79, 1999.
    Google ScholarFindings
  • Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Richthofer, S. and Wiskott, L. Predictable feature analysis. In Proceedings - 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015, 2016. ISBN 9781509002870. doi: 10.1109/ ICMLA.2015.158.
    Google ScholarLocate open access versionFindings
  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S., and Brain, G. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842.
    Findings
  • Zhu, X. and Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. In Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.
    Google ScholarFindings
  • Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation for unsupervised learning of visual embeddings. arXiv preprint arXiv:1903.12355, 2019.
    Findings
  • Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
    Findings
  • Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
    Findings
  • Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (University of Illinois, Urbana, IL), Vol 37, pp 368–377., pp. 1–16, 1999. doi: 10.1142/ S0217751X10050494.
    Google ScholarLocate open access versionFindings
  • van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Wang, X. and Gupta, A. Unsupervised learning of visual representations using videos. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Wiskott, L. and Sejnowski, T. J. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
    Google ScholarLocate open access versionFindings
  • Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018.
    Google ScholarLocate open access versionFindings
  • Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised Data Augmentation. arXiv e-prints, art. arXiv:1904.12848, Apr 2019.
    Findings
  • Zamir, A. R., Wekel, T., Agrawal, P., Wei, C., Malik, J., and Savarese, S. Generic 3D representation via pose estimation and matching. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4L: Self-supervised semi-supervised learning. arXiv preprint arXiv:1905.03670, 2019.
    Findings
Full Text
Your rating :
0

 

Tags
Comments