Online Knowledge Distillation via Collaborative Learning

CVPR, pp. 11017-11026, 2020.

Cited by: 4|Bibtex|Views145|DOI:https://doi.org/10.1109/CVPR42600.2020.01103
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
Our method focuses on fusing information of students and guaranteeing the quality of soft target to improve the generalization ability of students

Abstract:

This work presents an efficient yet effective online Knowledge Distillation method via Collaborative Learning, termed KDCL, which is able to consistently improve the generalization ability of deep neural networks (DNNs) that have different learning capacities. Unlike existing two-stage knowledge distillation approaches that pre-train a DN...More

Code:

Data:

0
Introduction
  • Knowledge distillation [10] is typically formulated as “teacher-student” learning setting.
  • In KDCL, student networks with different capacities learn collaboratively to generate high-quality soft target supervision, which distills the additional knowledge to each student as illustrated in Fig.1d.
  • The authors propose to generate high-quality soft target supervision by carefully ensembling the output of students with the information of ground truth in an online manner.
Highlights
  • Knowledge distillation [10] is typically formulated as “teacher-student” learning setting
  • It is able to improve performance of a compact ‘student’ deep neural network because the representation of a ‘teacher’ network can be used as structured knowledge to guide the training of student
  • We propose a novel online knowledge distillation method via collaborative learning
  • Our method focuses on fusing information of students and guaranteeing the quality of soft target to improve the generalization ability of students
  • We propose a series of methods to generate a soft target, which ensures that students with different capacity benefit from collaborative learning and enhances the invariance of the network against input perturbations
  • In experiments on ImageNet, we analyze the effectiveness of our method of generating soft target based on the student pair, ResNet-50 and ResNet-18, and evaluate a series of network architectures on ImageNet
Results
  • A series of model ensembling methods are designed to dynamically generate high-quality soft targets in a one-stage online knowledge distillation framework.
  • The authors propose a series of methods to generate a soft target, which ensures that students with different capacity benefit from collaborative learning and enhances the invariance of the network against input perturbations.
  • In order to improve invariance against the perturbations in the data domain, the authors generate identical soft target for all students fed with similar distorted images.
  • With the knowledge of the additional training data, the soft target encourages the sub-networks to have a lower generalization error.
  • In experiments on ImageNet, the authors analyze the effectiveness of the method of generating soft target based on the student pair, ResNet-50 and ResNet-18, and evaluate a series of network architectures on ImageNet. Dataset and training details.
  • The authors separate 20,000 images from the train set, 20 samples on each class, as the validation set to measure the generalization ability of sub-network for KDCL-General.
  • The weight for ensembling the prediction is updated each epoch rather than each iteration to save computational cost, so the generated soft target is not as good as KDCL-Linear and KDCL-MinLogit.
  • For KDCL-General, the authors separate 5,000 images from the training set, 50 samples on each class, as the validation set to measure the generalization ability of students.
  • For distillation (2nd row in Tab. 7) from Wide-ResNet-16 [30] with a widening factor 2 (WRN-16-2) to ResNet-32, the authors observe that the accuracy of the student network ResNet-32 is 93.37% on the train set, behind the teacher network WRN-16-2 99.39%, while the test error is lower than WRN-16-2.
Conclusion
  • The results on ImageNet. The authors conjecture that the soft target with less cross-entropy loss on CIFAR-100 train set leads to over-fitting like the one-hot label.
  • KDCL-General significantly improves the performance by the more general teacher model according to the optimal weighted average on the validation set.
  • The improvement comes from fusing information of different distorting images and the shared soft target encourages the sub-network to output with similar input.
Summary
  • Knowledge distillation [10] is typically formulated as “teacher-student” learning setting.
  • In KDCL, student networks with different capacities learn collaboratively to generate high-quality soft target supervision, which distills the additional knowledge to each student as illustrated in Fig.1d.
  • The authors propose to generate high-quality soft target supervision by carefully ensembling the output of students with the information of ground truth in an online manner.
  • A series of model ensembling methods are designed to dynamically generate high-quality soft targets in a one-stage online knowledge distillation framework.
  • The authors propose a series of methods to generate a soft target, which ensures that students with different capacity benefit from collaborative learning and enhances the invariance of the network against input perturbations.
  • In order to improve invariance against the perturbations in the data domain, the authors generate identical soft target for all students fed with similar distorted images.
  • With the knowledge of the additional training data, the soft target encourages the sub-networks to have a lower generalization error.
  • In experiments on ImageNet, the authors analyze the effectiveness of the method of generating soft target based on the student pair, ResNet-50 and ResNet-18, and evaluate a series of network architectures on ImageNet. Dataset and training details.
  • The authors separate 20,000 images from the train set, 20 samples on each class, as the validation set to measure the generalization ability of sub-network for KDCL-General.
  • The weight for ensembling the prediction is updated each epoch rather than each iteration to save computational cost, so the generated soft target is not as good as KDCL-Linear and KDCL-MinLogit.
  • For KDCL-General, the authors separate 5,000 images from the training set, 50 samples on each class, as the validation set to measure the generalization ability of students.
  • For distillation (2nd row in Tab. 7) from Wide-ResNet-16 [30] with a widening factor 2 (WRN-16-2) to ResNet-32, the authors observe that the accuracy of the student network ResNet-32 is 93.37% on the train set, behind the teacher network WRN-16-2 99.39%, while the test error is lower than WRN-16-2.
  • The results on ImageNet. The authors conjecture that the soft target with less cross-entropy loss on CIFAR-100 train set leads to over-fitting like the one-hot label.
  • KDCL-General significantly improves the performance by the more general teacher model according to the optimal weighted average on the validation set.
  • The improvement comes from fusing information of different distorting images and the shared soft target encourages the sub-network to output with similar input.
Tables
  • Table1: Top-1 accuracy on ImageNet-2012 validation set. The second column is the pre-trained teacher model’s performance, and the third one is student model accuracy trained with KD loss. The student gets 70.1% accuracy supervised by the hard target
  • Table2: Top-1 accuracy rate (%) on ImageNet. All the models are reimplemented with our training procedure for a fair comparison. Gain indicates the sum of the component student network improvement. ONE and CLNN are incompatible with different network structures. Therefore, only the accuracy of ResNet-18 is compared
  • Table3: Top-1 and Top-5 accuracy rate (%) on ImageNet. The backbone is ResNet-18. ONE is trained with 3 branches (Res4 block) and CLNN has a hierarchical design with 4 heads. For KDCL, ResNet-18 is trained with a peer network
  • Table4: The comparative result of different sub-network on ImageNet validation set. MBV2 is the abbreviation of MobileNetV2. MBV2x0.5 represents the width multiplier is 0.5. ResNet-50* and ResNet-18* are trained for 100 epochs. MBV2* and MBV2x0.5* are trained for 200 epochs
  • Table5: KDCL benefits from ensembling more sub-networks. All the networks are ResNet-18 to prevent the impact of network performance differences
  • Table6: Top-1 accuracy rate (%) on ImageNet. ResNet-50 is significantly improved with the knowledge from three compact models
  • Table7: The comparative and ablative result of our generate distillation method on CIFAR-100 dataset. ICL is invariant collaborative learning. We only report the accuracy of ResNet-32 as ONE and CLNN are incompatible with WRN-16-2
  • Table8: Average precision (AP) on COCO 2017 validation set with pre-trained ResNet-18. All models are used as backbones for Faster-RCNN [<a class="ref-link" id="c19" href="#r19">19</a>], Mask-RCNN [<a class="ref-link" id="c7" href="#r7">7</a>] based on FPN [<a class="ref-link" id="c16" href="#r16">16</a>]
Download tables as Excel
Related work
  • Knowledge transfer for the neural network is advocated by [2, 10] to distill the knowledge from teacher to student. An obvious way is to let the student imitate the output of the teacher model. [2] proposes to improve shallow networks by penalizing the difference of logits between the student and the teacher. [10] realizes knowledge distillation by minimizing the Kullback-Leibler (KL) divergence loss of their output categorical probability.

    Structure knowledge Based on the pioneering work, many methods have been proposed to excavate more information from the teacher. [20] introduces more supervision by further exploiting the feature of intermediate hidden layers. [31] defines additional attention information combined with distillation. [18] mines mutual relations of data examples by distance-wise and angle-wise losses. [23] establishes an equivalence between Jacobian matching and distillation. [9] transfers more accurate information via the route to the decision boundary. A few recent papers about self-distillation [29, 3, 6, 28] have shown that a converged teacher model supervising a student model of identical architecture could improve the generalization ability over the teacher. In contrast to mimicking complex models, KDCL involves all networks in learning and provides hint via fusing the information of the students. Without any additional loss for intermediate layers, KDCL reduces the difficulty of optimizing model.
Reference
  • Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018. 2
    Findings
  • Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014. 2
    Google ScholarLocate open access versionFindings
  • Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi. Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641, 2018. 2
    Findings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A Large-Scale Hierarchical Image Database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 2
    Google ScholarLocate open access versionFindings
  • Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017. 8
    Findings
  • Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018. 2
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 2, 3
    Google ScholarLocate open access versionFindings
  • Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge distillation with adversarial samples supporting decision boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3771– 3778, 2012
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 1, 2, 3, 5, 7
    Findings
  • Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016. 3
    Google ScholarLocate open access versionFindings
  • Cheng Ju, Aurelien Bibaut, and Mark van der Laan. The relative performance of ensemble methods with deep convolutional neural networks for image classification. Journal of Applied Statistics, 45(15):2800–2818, 2018. 3
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. 2
    Google ScholarFindings
  • Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003. 2
    Google ScholarLocate open access versionFindings
  • Xu Lan, Xiatian Zhu, and Shaogang Gong. Knowledge distillation by on-the-fly native ensemble. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 7528–7538. Curran Associates Inc., 2018. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 7
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 7
    Google ScholarLocate open access versionFindings
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. 2
    Findings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 5
    Google ScholarLocate open access versionFindings
  • Guocong Song and Wei Chai. Collaborative learning for deep neural networks. In Advances in Neural Information Processing Systems, pages 1832–1841, 2018. 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Suraj Srinivas and Francois Fleuret. Knowledge transfer with Jacobian matching. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4723–4731, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. 2
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. 3
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. 5
    Google ScholarLocate open access versionFindings
  • Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013. 3
    Google ScholarLocate open access versionFindings
  • David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992. 3
    Google ScholarLocate open access versionFindings
  • Chenglin Yang, Lingxi Xie, Chi Su, and Alan L. Yuille. Snapshot distillation: Teacher-student optimization in one generation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2
    Google ScholarLocate open access versionFindings
  • Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4133–4141, 2017. 2
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 7
    Findings
  • Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4320–4328, 2018. 1, 2, 5, 7
    Google ScholarLocate open access versionFindings
  • Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: many could be better than all. Artificial intelligence, 137(1-2):239–263, 2002. 3
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments