AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We give a transfer risk bound based on angle distribution in random feature space

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

NIPS 2020, (2020)

Cited by: 0|Views110
EI
Full Text
Bibtex
Weibo

Abstract

Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent ...More

Code:

Data:

0
Introduction
  • Deep neural network has been a successful tool in many fields of artificial intelligence.
  • The authors try to explain this with a new transfer risk bound for converged linearized student networks, based on distribution in random feature space.
  • The authors show that a little portion of hard labels can correct student’s outputs pointwisely, and reduce the angle between student and oracle weight.
Highlights
  • Deep neural network has been a successful tool in many fields of artificial intelligence
  • We try to explain this with a new transfer risk bound for converged linearized student networks, based on distribution in random feature space
  • Theorem 2. (Effect of Hard Labels) We introduce correction logits δzh to approximate zs,eff solved by Eq 3 as a linear combination zs,eff ≈ zt + (1 − ρ)δzh in the limit of ρ → 1
  • We give a transfer risk bound based on angle distribution in random feature space
  • Even if hard labels are data inefficient, we demonstrate that they can correct an imperfect teacher’s mistakes, and a little portion of hard labels are needed in practical distillation
  • We would like to design new form of distillation loss which does not suffer from discontinuity and at the same time, can still correct teacher’s mistakes
Results
  • Exp(−z)) is sigmoid function, zs,n = f are the output logits of student network, yt,n = σ(zt,n/T ) are soft labels of teacher network and yg,n = 1{fg(xn) > 0} are the ground truth hard labels.
  • R shows a power relation with respect to sample size n, and pure soft distillation shows a significantly faster converging rate.
  • The authors first state its rigorous definition and discuss how the stopping epoch of teacher and soft ratio affect data inefficiency.
  • [1] proves that for 2-layer over-parameterized network trained with 2-loss, ∆z Θ−n 1∆z/n is a generalization bound for the global minima.
  • The benefit of KD is that teacher provides student with smoothened output function that is easier to train than with hard labels.
  • According to empirical observations in the original paper of KD [9], a small portion of hard labels are required to achieve better generalization error than pure soft ones.
  • The dependency of ∆ws,eff on ∆wg and ∆wt is implicit, so the authors only consider the effect of hard labels near pure soft distillation (ρ → 1) from an imperfect teacher.
  • The change of cos α(∆w, ∆wg ) with respect to hard label can be approximated by linear expansion, as summarized by the following theorem.
  • (Effect of Hard Labels) The authors introduce correction logits δzh to approximate zs,eff solved by Eq 3 as a linear combination zs,eff ≈ zt + (1 − ρ)δzh in the limit of ρ → 1.
  • It can be written as a projection δwh , ∆wc , where δwh = φ(X)Θ−n 1δzh is the change of student’s weight by adding hard labels, and ∆wc =
Conclusion
  • The authors show that, the fast decay of transfer risk for pure soft label distillation, may be caused by the fast decay of student’s weight angle with respect to that of oracle model.
  • The authors show that, early stopping of teacher and distillation with a higher soft ratio are both beneficial in making efficient use of data.
  • The authors would like to tighten the transfer risk bound and fit it to a practical nonlinear neural network
Summary
  • Deep neural network has been a successful tool in many fields of artificial intelligence.
  • The authors try to explain this with a new transfer risk bound for converged linearized student networks, based on distribution in random feature space.
  • The authors show that a little portion of hard labels can correct student’s outputs pointwisely, and reduce the angle between student and oracle weight.
  • Exp(−z)) is sigmoid function, zs,n = f are the output logits of student network, yt,n = σ(zt,n/T ) are soft labels of teacher network and yg,n = 1{fg(xn) > 0} are the ground truth hard labels.
  • R shows a power relation with respect to sample size n, and pure soft distillation shows a significantly faster converging rate.
  • The authors first state its rigorous definition and discuss how the stopping epoch of teacher and soft ratio affect data inefficiency.
  • [1] proves that for 2-layer over-parameterized network trained with 2-loss, ∆z Θ−n 1∆z/n is a generalization bound for the global minima.
  • The benefit of KD is that teacher provides student with smoothened output function that is easier to train than with hard labels.
  • According to empirical observations in the original paper of KD [9], a small portion of hard labels are required to achieve better generalization error than pure soft ones.
  • The dependency of ∆ws,eff on ∆wg and ∆wt is implicit, so the authors only consider the effect of hard labels near pure soft distillation (ρ → 1) from an imperfect teacher.
  • The change of cos α(∆w, ∆wg ) with respect to hard label can be approximated by linear expansion, as summarized by the following theorem.
  • (Effect of Hard Labels) The authors introduce correction logits δzh to approximate zs,eff solved by Eq 3 as a linear combination zs,eff ≈ zt + (1 − ρ)δzh in the limit of ρ → 1.
  • It can be written as a projection δwh , ∆wc , where δwh = φ(X)Θ−n 1δzh is the change of student’s weight by adding hard labels, and ∆wc =
  • The authors show that, the fast decay of transfer risk for pure soft label distillation, may be caused by the fast decay of student’s weight angle with respect to that of oracle model.
  • The authors show that, early stopping of teacher and distillation with a higher soft ratio are both beneficial in making efficient use of data.
  • The authors would like to tighten the transfer risk bound and fit it to a practical nonlinear neural network
Related work
  • Our work is built on neural tangent kernel techniques introduced in [11, 13]. They find that in the limit of infinitely wide network, the Gram matrix of network’s random feature tends to a fixed limit called neural tangent kernel (NTK), and also stays almost constant during training. This results in an equivalence of training dynamics between the original network and linear model of network’s random features. Therefore we replace the network with its linear model to avoid the trouble of nonlinearity. The most related work to ours is [18]. They consider distillation of linear models and gives a loose transfer risk bound. This bound is based on the probability distribution in feature space and therefore is different form the traditional generalization given by Rademacher complexity. We improve their bound and generalize their formulation to the case of linearization of an actual neural network.
Funding
  • This project is supported by The National Defense Basic Scientific Research Project, China (No JCKY2018204C004), National Natural Science Foundation of China (No.61806009 and 61932001), PKU-Baidu Funding 2019BD005, Beijing Academy of Artificial Intelligence (BAAI)
Reference
  • Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332, 2019.
    Google ScholarLocate open access versionFindings
  • Shivam Barwey, Venkat Raman, and Adam Steinberg. Extracting information overlap in simultaneous oh-plif and piv fields with neural networks. arXiv preprint arXiv:2003.03662, 2020.
    Findings
  • Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pages 10835–10845, 2019.
    Google ScholarLocate open access versionFindings
  • Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning overparameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.
    Findings
  • Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4794–4802, 2019.
    Google ScholarLocate open access versionFindings
  • Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation ≈ early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network. arXiv preprint arXiv:1910.01255, 2019.
    Findings
  • Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685, 2019.
    Google ScholarLocate open access versionFindings
  • Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Conference on Machine Learning, pages 1607– 1616, 2018.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.
    Findings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
    Google ScholarLocate open access versionFindings
  • Carlos Lassance, Myriam Bontonou, Ghouthi Boukli Hacene, Vincent Gripon, Jian Tang, and Antonio Ortega. Deep geometric knowledge distillation with graphs. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8484–8488. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8570–8581, 2019.
    Google ScholarLocate open access versionFindings
  • Seunghyun Lee and B Song. Graph-based knowledge distillation by multi-head attention network. arXiv preprint arXiv:1907.02226, 2019.
    Findings
  • Pascal Massart, Élodie Nédélec, et al. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
    Google ScholarLocate open access versionFindings
  • Hossein Mobahi, Mehrdad Farajtabar, and Peter L Bartlett. Self-distillation amplifies regularization in hilbert space. arXiv preprint arXiv:2002.05715, 2020.
    Findings
  • Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role of over-parametrization in generalization of neural networks. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In International Conference on Machine Learning, pages 5142–5151, 2019.
    Google ScholarLocate open access versionFindings
  • Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Advances in Neural Information Processing Systems, pages 4763–4772, 2019.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
    Google ScholarLocate open access versionFindings
  • Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H Chi, and Sagar Jain. Understanding and improving knowledge distillation. arXiv preprint arXiv:2002.03532, 2020.
    Findings
  • Alexander B Tsybakov et al. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
    Google ScholarLocate open access versionFindings
  • Zheng Xu, Yen-Chang Hsu, and Jiawei Huang. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513, 2017.
    Findings
  • Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019.
    Findings
  • Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang. Knowledge extraction with no observable data. In Advances in Neural Information Processing Systems, pages 2701–2710, 2019.
    Google ScholarLocate open access versionFindings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
    Findings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科