Contrastive Representation Distillation

ICLR, 2020.

Cited by: 7|Bibtex|Views677
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We have developed a novel technique for neural network distillation, using the concept of contrastive objectives, which are usually used for representation learning

Abstract:

Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these proble...More
0
Introduction
  • Knowledge distillation (KD) transfers knowledge from one deep learning model to another.
  • In the problem of “cross-modal distillation", the authors may wish to transfer the representation of an image processing network to a sound (Aytar et al, 2016) or to depth (Gupta et al, 2016) processing network, such that deep features for an image and the associated sound or depth features are highly correlated.
Highlights
  • Knowledge distillation (KD) transfers knowledge from one deep learning model to another
  • We evaluate our contrastive representation distillation (CRD) framework in three knowledge distillation tasks: (a) model compression of a large network to a smaller one; (b) cross-modal knowledge transfer; (c) ensemble distillation from a group of teachers to a single student network
  • PKT, SP and contrastive representation distillation that operate on last several layers performs well
  • Results on ImageNet validates the scalability of our contrastive representation distillation
  • Transferability of representations We are interested in representations, and a primary goal of representation learning is to acquire general knowledge, that is, knowledge that transfers to tasks or datasets that were unseen during training
  • We have developed a novel technique for neural network distillation, using the concept of contrastive objectives, which are usually used for representation learning
Methods
  • The key idea of contrastive learning is very general: learn a representation that is close in some metric space for “positive” pairs and push apart the representation between “negative” pairs.
  • Given two deep neural networks, a teacher f T and a student f S.
  • Let x be the network input; the authors denote representations at the penultimate layer as f T (x) and f S(x) respectively.
  • The authors define random variables S and T for the student and teacher’s representations of the data respectively:
Results
  • Results on

    CIFAR100 Table 1 and Table 2 compare top-1 accuracies of different distillation objectives.
  • The authors found that KD works pretty well and none of the other methods consistently outperforms KD on their own
  • Another observation is that, while switching the teacher student combinations from same to different architectural styles, methods that distill intermediate representations tend to perform worse than methods that distill from the last several layers.
  • In (Zhang et al, 2018b), a Deep Mutual Learning setting was proposed where the teacher and student networks are trained simultaneously rather than sequentially.
  • The authors notice that the combination of KD and CRD leads to better performance on the student side, as shown in the last row of Table 9
Conclusion
  • The authors have developed a novel technique for neural network distillation, using the concept of contrastive objectives, which are usually used for representation learning.
  • The authors experimented with the objective on a number of applications such as model compression, cross-modal transfer and ensemble distillation, outperforming other distillation objectives by significant margins in all these tasks.
  • The authors' contrastive objective is the only distillation objective that consistently outperforms knowledge distillation across a wide variety of knowledge transfer tasks.
  • Contrastive learning is a simple and effective objective with practical benefits
Summary
  • Introduction:

    Knowledge distillation (KD) transfers knowledge from one deep learning model to another.
  • In the problem of “cross-modal distillation", the authors may wish to transfer the representation of an image processing network to a sound (Aytar et al, 2016) or to depth (Gupta et al, 2016) processing network, such that deep features for an image and the associated sound or depth features are highly correlated.
  • Methods:

    The key idea of contrastive learning is very general: learn a representation that is close in some metric space for “positive” pairs and push apart the representation between “negative” pairs.
  • Given two deep neural networks, a teacher f T and a student f S.
  • Let x be the network input; the authors denote representations at the penultimate layer as f T (x) and f S(x) respectively.
  • The authors define random variables S and T for the student and teacher’s representations of the data respectively:
  • Results:

    Results on

    CIFAR100 Table 1 and Table 2 compare top-1 accuracies of different distillation objectives.
  • The authors found that KD works pretty well and none of the other methods consistently outperforms KD on their own
  • Another observation is that, while switching the teacher student combinations from same to different architectural styles, methods that distill intermediate representations tend to perform worse than methods that distill from the last several layers.
  • In (Zhang et al, 2018b), a Deep Mutual Learning setting was proposed where the teacher and student networks are trained simultaneously rather than sequentially.
  • The authors notice that the combination of KD and CRD leads to better performance on the student side, as shown in the last row of Table 9
  • Conclusion:

    The authors have developed a novel technique for neural network distillation, using the concept of contrastive objectives, which are usually used for representation learning.
  • The authors experimented with the objective on a number of applications such as model compression, cross-modal transfer and ensemble distillation, outperforming other distillation objectives by significant margins in all these tasks.
  • The authors' contrastive objective is the only distillation objective that consistently outperforms knowledge distillation across a wide variety of knowledge transfer tasks.
  • Contrastive learning is a simple and effective objective with practical benefits
Tables
  • Table1: Test accuracy (%) of student networks on CIFAR100 of a number of distillation methods (ours is CRD); see Appendix for citations of other methods. ↑ denotes outperformance over KD and ↓ denotes underperformance. We note that CRD is the only method to always outperform KD (and also outperforms all other methods). We denote by * methods where we used our reimplementation based on the paper; for all other methods we used author-provided or author-verified code. Average over 5 runs
  • Table2: Top-1 test accuracy (%) of student networks on CIFAR100 of a number of distillation methods (ours is CRD) for transfer across very different teacher and student architectures. CRD outperforms KD and all other methods. Importantly, some methods that require very similar student and teacher architectures perform quite poorly. E.g. FSP (<a class="ref-link" id="cYim_et+al_2017_a" href="#rYim_et+al_2017_a">Yim et al, 2017</a>) cannot even be applied; AT (<a class="ref-link" id="cBa_2014_a" href="#rBa_2014_a">Ba & Caruana, 2014</a>) and FitNet (<a class="ref-link" id="cZagoruyko_2016_a" href="#rZagoruyko_2016_a">Zagoruyko & Komodakis, 2016a</a>) perform very poorly etc. We denote by * methods where we used our reimplementation based on the paper; for all other methods we used author-provided or author-verified code. Average over 3 runs
  • Table3: Top-1 and Top-5 error rates (%) of student network ResNet-18 on ImageNet validation set. We use ResNet-34 released by PyTorch team as our teacher network, and follow the standard training practice of ImageNet on PyTorch except that we train for 10 more epochs. We compare our CRD with KD (<a class="ref-link" id="cHinton_et+al_2015_a" href="#rHinton_et+al_2015_a">Hinton et al, 2015</a>), AT (<a class="ref-link" id="cZagoruyko_2016_a" href="#rZagoruyko_2016_a">Zagoruyko & Komodakis, 2016a</a>) and Online-KD (<a class="ref-link" id="cLan_et+al_2018_a" href="#rLan_et+al_2018_a">Lan et al, 2018</a>). “*” reported by the original paper <a class="ref-link" id="cLan_et+al_2018_a" href="#rLan_et+al_2018_a">Lan et al (2018</a>) using an ensemble of online ResNets as teacher, no pretrained ResNet-34 was used
  • Table4: We transfer the representation learned from CIFAR100 to STL-10 and TinyImageNet datasets by freezing the network and training a linear classifier on top of the last feature layer to perform 10-way (STL-10) or 200-way (TinyImageNet) classification. For this experiment, we use the combination of teacher network WRN-40-2 and student network WRN-16-2. Classification accuracies (%) are reported
  • Table5: Performance on the task of using depth to predict semantic segmentation labels. We initialize the depth network either randomly or by distilling from a ImageNet pre-pretrained ResNet-18 teacher
  • Table6: Ablative study of different contrastive objectives and negative sampling policies on CIFAR100. For contrastive objectives, we compare our objective with InfoNCE (Oord et al, 2018); For negative sampling policy, when given an anchor image xi from the dataset, we consider either randomly sample negative xj such that (a) i = j, or (b) yi = yj where y represents the class label. Average over 5 runs
  • Table7: Test accuracy (%) of student networks on CIFAR100 of combining distillation methods with KD; we check the compatibility of our objective with KD as well as PKT. ↑ denotes outperformance over KD and ↓ denotes underperformance
  • Table8: We measure the transferability of the student network, by evaluating a linear classifier on top of its frozen representations on STL10 (abbreviated as “STL”) and TinyImageNet (abbreviated as “TI”). The best accuracy is bolded and the second best is underlined
  • Table9: Test accuracy (%) of student and teacher networks on CIFAR100 with the Deep Mutual Training Zhang et al (2018b) setting, where the teacher and student networks are trained simultaneously rather than sequentially. We use “T” and “S” to denote the teacher and student models, respectively
  • Table10: Test accuracy (%) of student networks on CIFAR100 of a number of distillation methods (ours is CRD). Standard deviation is provided
  • Table11: Top-1 test accuracy (%) of student networks on CIFAR100 of a number of distillation methods (ours is CRD) for transfer across very different teacher and student architectures. Standard deviation is provided
Related work
  • The seminal work of Buciluaet al. (2006) and Hinton et al (2015) introduced the idea of knowledge distillation between large, cumbersome models into smaller, faster models without losing too much generalization power. The general motivation was that at training time, the availability of computation allows “slop" in model size, and potentially faster learning. But computation and memory constraints at inference time necessitate the use of smaller models. Buciluaet al. (2006) achieve this by matching output logits; Hinton et al (2015) introduced the idea of temperature in the softmax outputs to better represent smaller probabilities in the output of a single sample. These smaller probabilities provide useful information about the learned representation of the teacher model; some tradeoff between large temperatures (which increase entropy) or small temperatures tend to provide the highest transfer of knowledge between student and teacher. The method in (Li et al, 2014) was also closely related to (Hinton et al, 2015).
Funding
  • This research was supported in part by Google Cloud and iFlytek
Reference
  • Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9163–9171, 2019. 3, 13, 15
    Google ScholarLocate open access versionFindings
  • Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016. 8
    Findings
  • Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019. 1
    Findings
  • Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems, 2016. 1, 5
    Google ScholarLocate open access versionFindings
  • Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, 2014. 6
    Google ScholarLocate open access versionFindings
  • Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM, 2002
    Google ScholarLocate open access versionFindings
  • Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, 2011. 5
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014. 3
    Google ScholarLocate open access versionFindings
  • Ian J Goodfellow. On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515, 2014. 3
    Findings
  • Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2827–2836, 2016. 1
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010. 1, 3, 5
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 6, 14
    Google ScholarLocate open access versionFindings
  • Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3779–3787, 2019. 13, 15
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 201, 2, 5, 7, 13, 15, 16
    Findings
  • R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018. 1, 3, 8
    Findings
  • Judy Hoffman, Saurabh Gupta, and Trevor Darrell. Learning with side information through modality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 826–834, 2016a. 5
    Google ScholarLocate open access versionFindings
  • Judy Hoffman, Saurabh Gupta, Jian Leong, Sergio Guadarrama, and Trevor Darrell. Cross-modal adaptation for rgb-d detection. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5032–5039. IEEE, 2016b. 5
    Google ScholarLocate open access versionFindings
  • Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017. 3, 13, 15
    Findings
  • Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems, pp. 2760–2769, 2018. 3, 13, 15
    Google ScholarLocate open access versionFindings
  • Animesh Koratana, Daniel Kang, Peter Bailis, and Matei Zaharia. Lit: Learned intermediate representation training for model compression. In International Conference on Machine Learning, 2019. 3
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 5
    Google ScholarFindings
  • Xu Lan, Xiatian Zhu, and Shaogang Gong. Knowledge distillation by on-the-fly native ensemble. In Advances in neural information processing systems, 2018. 7
    Google ScholarLocate open access versionFindings
  • Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. Learning small-size dnn with outputdistribution-based criteria. In Fifteenth annual conference of the international speech communication association, 2014. 2
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 1, 3, 5, 9, 10
    Findings
  • Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976, 2019. 13, 15
    Google ScholarLocate open access versionFindings
  • Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 268–284, 2018. 13, 15
    Google ScholarLocate open access versionFindings
  • Baoyun Peng, Xiao Jin, Jiaheng Liu, Shunfeng Zhou, Yichao Wu, Yu Liu, Dongsheng Li, and Zhaoning Zhang. Correlation congruence for knowledge distillation. arXiv preprint arXiv:1904.01802, 2019. 10, 13, 15
    Findings
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. 1, 3, 5, 13, 15
    Findings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 14
    Google ScholarLocate open access versionFindings
  • Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, 2012. 5
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 14
    Findings
  • Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 14
    Google ScholarLocate open access versionFindings
  • Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019. 3
    Findings
  • Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. arXiv preprint arXiv:1907.09682, 2019. 10, 13, 15
    Findings
  • Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5
    Google ScholarLocate open access versionFindings
  • Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141, 2017. 3, 6, 13, 15
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016a. 1, 2, 3, 5, 6, 7, 13, 15, 16
    Findings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016b. 6, 14
    Findings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 8
    Google ScholarLocate open access versionFindings
  • Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856, 2018a. 14
    Google ScholarLocate open access versionFindings
  • Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018b. 15, 18
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments