Supervised Contrastive Learning

NeurIPS, 2020.

Cited by: 25|Bibtex|Views617
EI
Other Links: arxiv.org|dblp.uni-trier.de
Weibo:
This opens the possibility of applications in semisupervised learning which can leverage the benefits of a single loss that can smoothly shift behavior based on the availability of labeled data

Abstract:

Cross entropy is the most widely used loss function for supervised training of image classification models. In this paper, we propose a novel training methodology that consistently outperforms cross entropy on supervised learning tasks across different architectures and data augmentations. We modify the batch contrastive loss, which has...More
0
Introduction
  • The cross-entropy loss is the most widely used loss function for supervised learning.
  • It is naturally defined as the KL-divergence between two discrete distributions: the label distribution and the empirical distribution of the logits.
  • Many proposed improvements to regular crossentropy involve a loosening of the definition of the loss, that the reference distribution is axisaligned.
Highlights
  • The cross-entropy loss is the most widely used loss function for supervised learning
  • We propose a novel extension to the contrastive loss function that allows for multiple positives per anchor
  • We show analytically that the gradient of our loss function encourages learning from hard positives and hard negatives
  • We have presented a novel loss, inspired by contrastive learning, that outperforms cross entropy on classification accuracy and robustness benchmarks
  • The loss function provides a natural connection between fully unsupervised training on the one end, and fully supervised training on the other
  • This opens the possibility of applications in semisupervised learning which can leverage the benefits of a single loss that can smoothly shift behavior based on the availability of labeled data
Methods
  • The authors start by reviewing the contrastive learning loss for self-supervised representation learning, as used in recent papers that achieve state of the art results [36, 21, 46, 6].
  • In light of the findings of [6] that self-supervised contrastive loss requires significantly different data augmentation than cross-entropy loss, for the second stage the authors evaluate three different options:
Results
  • Similar to the results for self-supervised contrastive learning [46, 6], the authors found representations from the encoder to give improved performance on downstream tasks than those from the projection network.
  • The authors achieve a new state of the art accuracy of 78.8% on ResNet-50 with AutoAugment
Conclusion
  • The authors have presented a novel loss, inspired by contrastive learning, that outperforms cross entropy on classification accuracy and robustness benchmarks.
  • The authors' experiments show that this loss is less sensitive to hyperparameter changes, which could be a useful practical consideration.
  • The loss function provides a natural connection between fully unsupervised training on the one end, and fully supervised training on the other.
  • This opens the possibility of applications in semisupervised learning which can leverage the benefits of a single loss that can smoothly shift behavior based on the availability of labeled data
Summary
  • Introduction:

    The cross-entropy loss is the most widely used loss function for supervised learning.
  • It is naturally defined as the KL-divergence between two discrete distributions: the label distribution and the empirical distribution of the logits.
  • Many proposed improvements to regular crossentropy involve a loosening of the definition of the loss, that the reference distribution is axisaligned.
  • Methods:

    The authors start by reviewing the contrastive learning loss for self-supervised representation learning, as used in recent papers that achieve state of the art results [36, 21, 46, 6].
  • In light of the findings of [6] that self-supervised contrastive loss requires significantly different data augmentation than cross-entropy loss, for the second stage the authors evaluate three different options:
  • Results:

    Similar to the results for self-supervised contrastive learning [46, 6], the authors found representations from the encoder to give improved performance on downstream tasks than those from the projection network.
  • The authors achieve a new state of the art accuracy of 78.8% on ResNet-50 with AutoAugment
  • Conclusion:

    The authors have presented a novel loss, inspired by contrastive learning, that outperforms cross entropy on classification accuracy and robustness benchmarks.
  • The authors' experiments show that this loss is less sensitive to hyperparameter changes, which could be a useful practical consideration.
  • The loss function provides a natural connection between fully unsupervised training on the one end, and fully supervised training on the other.
  • This opens the possibility of applications in semisupervised learning which can leverage the benefits of a single loss that can smoothly shift behavior based on the availability of labeled data
Tables
  • Table1: Top-1/Top-5 accuracy results on ImageNet on ResNet-50 and ResNet-200 with AutoAugment [<a class="ref-link" id="c9" href="#r9">9</a>] being used as the augmentation for Supervised Contrastive learning. Achieving 78.8% on ResNet-50, we outperform all of the top methods whose performance is shown above. Baseline numbers are taken from the referenced papers and we also additionally reimplement cross-entropy ourselves for fair comparison
  • Table2: Training with Supervised Contrastive Loss makes models more robust to corruptions in images, as measured by Mean Corruption Error (mCE) and relative mCE over the ImageNet-C dataset [<a class="ref-link" id="c22" href="#r22">22</a>] (lower is better)
  • Table3: Comparison of Top-1 accuracy variability as a function of the number of positives Nyi in Eq 4 varies from 1 to 5. Adding more positives benefits the final Top-1 accuracy. We compare against previous state of the art self-supervised work [<a class="ref-link" id="c6" href="#r6">6</a>] which has used one positive which is another data augmentation of the same sample; see text for details
  • Table4: Top: Average Expected Calibration Error (ECE) over all the corruptions in ImageNet-C [<a class="ref-link" id="c22" href="#r22">22</a>] for a given level of severity (lower is better); Bottom: Average Top-1 Accuracy over all the corruptions for a given level of severity (higher is better)
  • Table5: Comparison between representations learnt using SupCon and representations learnt using Cross Entropy loss with either 1 stage of training or 2 stages (representation learning followed by linear classifier)
  • Table6: Results of training the ResNet-50 architecture with AutoAugment data augmentation policy for 350 epochs and then training the linear classifier for another 350 epochs. Learning rates were optimized for every optimizer while all other hyper-parameters were kept the same
  • Table7: Combinations of different data augmentations for ResNet-50 trained with optimal set of hyper-parameters and optimizers. We observe the best performance when the same data augmentation is used for both pre-training and training the linear classifier on top of the frozen embedding network
Download tables as Excel
Related work
  • Our work draws on existing literature in self-supervised representation learning, metric learning and supervised learning. Due to the large amount of literature, we focus on the most relevant papers. The cross-entropy loss was introduced as a powerful loss function to train deep networks [38, 1, 28]. The key idea is simple and intuitive: each class is assigned a target (usually 1-hot) vector and the logits at the last layer of the network, after a softmax transformation, are gradually transformed towards the target vector. However, it is unclear why these target labels should be the optimal ones; some work has been done into identifying better target labels vectors e.g. [52].

    In addition, a number of papers have studied other drawbacks of the cross-entropy loss, such as sensitivity to noisy labels [59, 44], adversarial examples [14, 34], and poor margins [4]. Alternative losses have been proposed; however, the more popular and effective ideas in practice have been approaches that change the reference label distribution, such as label smoothing [45, 33], data augmentations such as Mixup [56] and CutMix [55], and knowledge distillation [24].
Funding
  • Similar to the results for self-supervised contrastive learning [46, 6], we found representations from the encoder to give improved performance on downstream tasks than those from the projection network
  • We achieve a new state of the art accuracy of 78.8% on ResNet-50 with AutoAugment (for comparison, a number of the other top-performing methods are shown in Table 1)
Reference
  • Eric B Baum and Frank Wilczek. Supervised learning of probability distributions by neural networks. In Neural information processing systems, pages 52–61, 1988. 3
    Google ScholarLocate open access versionFindings
  • James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(Feb):281–305, 2019
    Google ScholarLocate open access versionFindings
  • Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906, 2017. 9
    Findings
  • Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pages 1565–1576, 2019. 3
    Google ScholarLocate open access versionFindings
  • Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(Mar):1109–1135, 2010. 2
    Google ScholarLocate open access versionFindings
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. 2, 4, 5, 6, 7, 10, 14, 15, 16
    Findings
  • Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005. 4
    Google ScholarLocate open access versionFindings
  • Marc Claesen and Bart De Moor. Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127, 2015. 9
    Findings
  • Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113–123, 2011, 2, 4, 8, 9, 10
    Google ScholarLocate open access versionFindings
  • Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019. 1, 4, 9, 16
    Findings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009. 1, 7
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3
    Findings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015. 4
    Google ScholarLocate open access versionFindings
  • Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In Advances in neural information processing systems, pages 842–852, 2018. 1, 3
    Google ScholarLocate open access versionFindings
  • Nicholas Frosst, Nicolas Papernot, and Geoffrey E. Hinton. Analyzing and improving representations with the soft nearest neighbor loss. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2012–2020. PMLR, 2019. 4
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. 8
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010. 4, 6
    Google ScholarLocate open access versionFindings
  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. 5
    Google ScholarFindings
  • Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 202, 6
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 4, 5, 8
    Google ScholarLocate open access versionFindings
  • Olivier J Henaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019. 2, 4, 5, 6
    Findings
  • Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. 2, 7, 8, 14, 15
    Findings
  • Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8), 2012. 10
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 3
    Findings
  • R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. 2, 4, 5
    Google ScholarLocate open access versionFindings
  • Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Large scale learning of general visual representations for transfer. arXiv preprint arXiv:1912.11370, 2019. 1
    Findings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 8
    Google ScholarLocate open access versionFindings
  • Esther Levin and Michael Fleisher. Accelerated learning in layered neural networks. Complex systems, 2:625–640, 1988. 3
    Google ScholarLocate open access versionFindings
  • Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv preprint arXiv:1905.00397, 2019. 15
    Findings
  • Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, page 7, 2016. 1
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. 3
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pages 2265–2273, 2013. 4
    Google ScholarLocate open access versionFindings
  • Rafael Muller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, pages 4696–4705, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Kamil Nar, Orhan Ocal, S Shankar Sastry, and Kannan Ramchandran. Cross-entropy loss and low-rank features have responsibility for adversarial examples. arXiv preprint arXiv:1901.08360, 2019. 3
    Findings
  • Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016. 4
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 4
    Findings
  • Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016. 10
    Findings
  • David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. 3
    Google ScholarLocate open access versionFindings
  • Ruslan Salakhutdinov and Geoff Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Artificial Intelligence and Statistics, pages 412–419, 2007. 4
    Google ScholarLocate open access versionFindings
  • Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 4, 5, 6, 7, 14
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Timecontrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018. 4
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015. 8
    Google ScholarLocate open access versionFindings
  • Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pages 1857–1865, 2016. 2, 4, 6
    Google ScholarLocate open access versionFindings
  • Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014. 1, 3
    Findings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 2, 3
    Google ScholarLocate open access versionFindings
  • Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019. 2, 4, 5, 6, 14
    Findings
  • Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. 15
    Google ScholarLocate open access versionFindings
  • Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009. 2, 3, 4, 6, 7
    Google ScholarLocate open access versionFindings
  • Zhirong Wu, Alexei A Efros, and Stella Yu. Improving generalization via scalable neighborhood component analysis. In European Conference on Computer Vision (ECCV) 2018, 2018. 4
    Google ScholarLocate open access versionFindings
  • Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018. 2, 4
    Google ScholarLocate open access versionFindings
  • Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252, 2019. 1
    Findings
  • Shuo Yang, Ping Luo, Chen Change Loy, Kenneth W Shum, and Xiaoou Tang. Deep representation learning with target coding. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. 3
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017. 10, 15
    Findings
  • Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pages 6023–6032, 2019. 1, 3, 8, 16
    Google ScholarLocate open access versionFindings
  • Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 2, 3, 8, 16
    Findings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1058–1067, 2017. 4
    Google ScholarLocate open access versionFindings
  • Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pages 8778–8788, 2018. 1, 3
    Google ScholarLocate open access versionFindings
  • 6. Effect of Temperature in Loss Function
    Google ScholarFindings
  • 2. Hard negatives: On the other hand, hard negatives have shown to improve classification accuracy when models are trained with the triplet loss [40]. Low temperatures are equivalent to optimizing for hard negatives: for a given batch of samples and a specific anchor, lowering the temperature increases the value of Pik (see Eq. 8) for samples which have larger inner product with the anchor, and reduces it for samples which have smaller inner product. Further the magnitude of gradient coming from a given sample k belonging to a different class than the anchor is proportional to the probability Pik. Therefore the model derives a large amount of training signal from samples which belong to a different class but it finds hard to separate from the given anchor, which is by definition a hard negative.
    Google ScholarFindings
  • 8. Comparison with Cross Entropy
    Google ScholarFindings
  • 9. Training Details
    Google ScholarFindings
  • 10. Derivation of Supervised Contrastive Learning Gradient In Sec 2 in the main paper, we presented motivation based on the functional form of the gradient of the supervised contrastive loss, Lsiup(zi) (Eq. 4 in the paper), that the supervised contrastive loss intrinsically causes learning to focus on hard positives and negatives, where the encoder can greatly benefit, instead of easy ones, where the encoder can only minimally benefit. In this section, we derive the mathematical expression for the gradient:
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments