Cross-Layer Distillation with Semantic Calibration

Cited by: 0|Bibtex|Views58
Other Links: arxiv.org
Weibo:
Feature maps produced by multiple intermediate layers of a powerful teacher model are valuable for improving knowledge transfer performance

Abstract:

Recently proposed knowledge distillation approaches based on feature-map transfer validate that intermediate layers of a teacher model can serve as effective targets for training a student model to obtain better generalization ability. Existing studies mainly focus on particular representation forms for knowledge transfer between manual...More
0
Introduction
  • The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014)
  • This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015).
  • The notation Fssl [i] denotes the output of student layer sl for the i-th instance and is a shorthand for Fssl [i, :, :, :]
Highlights
  • The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014)
  • This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015)
  • Feature maps produced by multiple intermediate layers of a powerful teacher model are valuable for improving knowledge transfer performance
  • To alleviate negative regularization effect due to semantic mismatch between certain pairs of teacher-student intermediate layers, we propose semantic calibration via attention allocation for effective crosslayer distillation
  • Experimental results show that training with Semantic Calibration for Cross-layer Knowledge Distillation (SemCKD) leads to a relative low-level semantic mismatch score and the generalization ability outperforms the compared approaches
  • We propose a novel technique to significantly improve effectiveness of feature-map transfer by semantic calibration via soft layer association
  • Visualization as well as detailed analysis provide some insights to the working principle of SemCKD
Methods
  • To demonstrate the effectiveness of the proposed semantic calibration strategy for cross-layer knowledge distillation, the authors conduct a series of classification tasks on the CIFAR100 (Krizhevsky and Hinton 2009) and ImageNet datasets (Russakovsky et al 2015).
  • A large variety of teacher-student combinations based on popular network architectures are evaluated, including VGG (Simonyan and Zisserman 2015), ResNet (He et al 2016), WRN (Zagoruyko and Komodakis 2016), MobileNet (Sandler et al 2018) and ShuffleNet (Ma et al 2018).
  • In addition to comparing SemCKD with representative feature-map distillation approaches, the authors provide results to support and explain the success of the semantic calibration strategy in helping student models obtain a proper regularization through three carefully designed experiments.
Results
  • The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014)
  • This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015).
  • Mean-Square-Error among feature maps in Equation (6) is replaced by these value vectors to calculate the overall loss, which reduces the performance by 2.76%
Conclusion
  • Feature maps produced by multiple intermediate layers of a powerful teacher model are valuable for improving knowledge transfer performance.
  • To alleviate negative regularization effect due to semantic mismatch between certain pairs of teacher-student intermediate layers, the authors propose semantic calibration via attention allocation for effective crosslayer distillation.
  • Each student layer in the approach distills knowledge contained in multiple target layers with an automatically learned attention distribution to obtain proper supervision.
  • Experimental results show that training with SemCKD leads to a relative low-level semantic mismatch score and the generalization ability outperforms the compared approaches.
  • Visualization as well as detailed analysis provide some insights to the working principle of SemCKD
Summary
  • Introduction:

    The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014)
  • This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015).
  • The notation Fssl [i] denotes the output of student layer sl for the i-th instance and is a shorthand for Fssl [i, :, :, :]
  • Objectives:

    Given a training dataset D = {(xi, yi)}Ni=1 consisting of N instances from K categories, and a powerful teacher model pre-trained on the dataset D, the goal is reusing the same dataset to train another simple student model with cheaper computational and storage demand
  • Methods:

    To demonstrate the effectiveness of the proposed semantic calibration strategy for cross-layer knowledge distillation, the authors conduct a series of classification tasks on the CIFAR100 (Krizhevsky and Hinton 2009) and ImageNet datasets (Russakovsky et al 2015).
  • A large variety of teacher-student combinations based on popular network architectures are evaluated, including VGG (Simonyan and Zisserman 2015), ResNet (He et al 2016), WRN (Zagoruyko and Komodakis 2016), MobileNet (Sandler et al 2018) and ShuffleNet (Ma et al 2018).
  • In addition to comparing SemCKD with representative feature-map distillation approaches, the authors provide results to support and explain the success of the semantic calibration strategy in helping student models obtain a proper regularization through three carefully designed experiments.
  • Results:

    The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014)
  • This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015).
  • Mean-Square-Error among feature maps in Equation (6) is replaced by these value vectors to calculate the overall loss, which reduces the performance by 2.76%
  • Conclusion:

    Feature maps produced by multiple intermediate layers of a powerful teacher model are valuable for improving knowledge transfer performance.
  • To alleviate negative regularization effect due to semantic mismatch between certain pairs of teacher-student intermediate layers, the authors propose semantic calibration via attention allocation for effective crosslayer distillation.
  • Each student layer in the approach distills knowledge contained in multiple target layers with an automatically learned attention distribution to obtain proper supervision.
  • Experimental results show that training with SemCKD leads to a relative low-level semantic mismatch score and the generalization ability outperforms the compared approaches.
  • Visualization as well as detailed analysis provide some insights to the working principle of SemCKD
Tables
  • Table1: Top-1 test accuracy of feature-map distillation approaches on CIFAR-100
  • Table2: Top-1 test accuracy of feature-map distillation approaches on ImageNet
  • Table3: Semantic Mismatch Score (log-scale) for VGG-8 & ResNet-32x4 on CIFAR-100
  • Table4: Ablation study: Top-1 test accuracy for VGG-8 & ResNet-32x4 on CIFAR-100
  • Table5: Top-1 test accuracy of feature-embedding distillation approaches on CIFAR-100
Download tables as Excel
Related work
  • Knowledge Distillation. KD serves as an effective recipe to improve the performance of a given student model by exploiting soft targets from a pre-trained teacher model (Hinton, Vinyals, and Dean 2015). Compared to discrete labels, fine-grained information among different categories provides extra supervision to optimize the student model better (Pereyra et al 2017; Muller, Kornblith, and Hinton 2019). A new interpretation for the improvement is that soft targets act as a learned label smoothing regularization to keep the student model from producing over-confident predictions (Yuan et al 2020). To save the expense of pre-training, some cost-effective online variants have been explored later (Anil et al 2018; Chen et al 2020).

    Feature-Map Distillation. Rather than only formalizing knowledge in a highly abstract form like predictions, recent methods attempted to leverage information contained in intermediate layers by designing elaborate knowledge representations. A bunch of techniques have been developed for this purpose, such as aligning hidden layer responses called hints (Romero et al 2015), mimicking spatial attention maps (Zagoruyko and Komodakis 2017), or maximizing the mutual information through variational principle (Ahn et al 2019). The transferred knowledge can also be captured by crude pairwise activation similarities (Tung and Mori 2019) or hybrid kernel formulations built on them (Passalis, Tzelepi, and Tefas 2020). With the pre-defined representations, all of the above methods perform knowledge transfer with certain hand-crafted layer associations, such as random selection or one-to-one match. Unfortunately, as pointed in (Passalis, Tzelepi, and Tefas 2020), these hard associations would make the student model suffer from negative regularization, which limits the effectiveness of feature-map distillation. Based on transfer learning framework, a newly solution is to learn the association weights by a meta-network given only feature maps of the source network (Jang et al.2019), while our proposed approach incorporates more information from teacher-student layer pairs.
Funding
  • The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014). This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015)
  • • We propose a novel technique to significantly improve effectiveness of feature-map transfer by semantic calibration via soft layer association
  • On average, SemCKD shows significantly relative improvement (68.34%) over all of the compared methods
  • In order to validate the effectiveness of allocating the attention of each student layer to multiple target layers, equal weight assignment is applied instead. This causes a lower accuracy by 2.33% (From 75.27% to 72.94%) and a considerably larger variance by 0.74%
  • And Mean-Square-Error among feature maps in Equation (6) is replaced by these value vectors to calculate the overall loss, which reduces the performance by 2.76%
Reference
  • Ahn, S.; Hu, S. X.; Damianou, A. C.; Lawrence, N. D.; and Dai, Z. 2019. Variational Information Distillation for Knowledge Transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9163– 9171.
    Google ScholarLocate open access versionFindings
  • Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G. E.; and Hinton, G. E. 2018. Large scale distributed neural network training through online distillation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ba, J.; and Caruana, R. 2014. Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems, 2654–2662.
    Google ScholarLocate open access versionFindings
  • Bengio, Y.; Courville, A. C.; and Vincent, P. 2013. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8): 1798–1828.
    Google ScholarLocate open access versionFindings
  • Bucilua, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535–541.
    Google ScholarLocate open access versionFindings
  • Chen, D.; Mei, J.-P.; Wang, C.; Feng, Y.; and Chen, C. 2020. Online Knowledge Distillation with Diverse Peers. In Proceedings of the AAAI Conference on Artificial Intelligence, 3430–3437.
    Google ScholarLocate open access versionFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.
    Findings
  • Jang, Y.; Lee, H.; Hwang, S. J.; and Shin, J. 201Learning What and Where to Transfer. In International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical Report.
    Google ScholarFindings
  • Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; and Duan, Y. 2019. Knowledge Distillation via Instance Relationship Graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7096–7104.
    Google ScholarLocate open access versionFindings
  • Ma, N.; Zhang, X.; Zheng, H.; and Sun, J. 2018. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision, 122–138.
    Google ScholarLocate open access versionFindings
  • Muller, R.; Kornblith, S.; and Hinton, G. E. 2019. When Does Label Smoothing Help? In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. Relational Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3967– 3976.
    Google ScholarLocate open access versionFindings
  • Passalis, N.; and Tefas, A. 2018. Learning Deep Representations with Probabilistic Knowledge Transfer. In European Conference on Computer Vision, 283–299.
    Google ScholarLocate open access versionFindings
  • Passalis, N.; Tzelepi, M.; and Tefas, A. 2020. Heterogeneous Knowledge Distillation using Information Flow Modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Peng, B.; Jin, X.; Li, D.; Zhou, S.; Wu, Y.; Liu, J.; Zhang, Z.; and Liu, Y. 2019. Correlation Congruence for Knowledge Distillation. In International Conference on Computer Vision, 5006–5015.
    Google ScholarLocate open access versionFindings
  • Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, Ł.; and Hinton, G. 2017. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548.
    Findings
  • Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and Bengio, Y. 2015. FitNets: Hints for thin deep nets. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. S.; Berg, A. C.; and Li, F. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3): 211–252.
    Google ScholarLocate open access versionFindings
  • Sandler, M.; Howard, A. G.; Zhu, M.; Zhmoginov, A.; and Chen, L. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520.
    Google ScholarLocate open access versionFindings
  • Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In International Conference on Computer Vision, 618–626.
    Google ScholarLocate open access versionFindings
  • Simonyan, K.; and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive Representation Distillation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Tung, F.; and Mori, G. 2019. Similarity-Preserving Knowledge Distillation. In International Conference on Computer Vision, 1365–1374.
    Google ScholarLocate open access versionFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
    Google ScholarLocate open access versionFindings
  • Yuan, L.; Tay, F. E.; Li, G.; Wang, T.; and Feng, J. 2020. Revisiting Knowledge Distillation via Label Smoothing Regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Zagoruyko, S.; and Komodakis, N. 2016. Wide Residual Networks. In Proceedings of the British Machine Vision Conference.
    Google ScholarLocate open access versionFindings
  • Zagoruyko, S.; and Komodakis, N. 2017. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments