AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We proposed the notion of label-awareness to explain the performance gap between a model trained by neural tangent kernel and real-world neural networks

Label-Aware Neural Tangent Kernel: Toward Better Generalization and Local Elasticity

NIPS 2020, (2020)

Cited by: 6|Views128
EI
Full Text
Bibtex
Weibo

Abstract

As a popular approach to modeling the dynamics of training overparametrized neural networks (NNs), the neural tangent kernels (NTK) are known to fall behind real-world NNs in generalization ability. This performance gap is in part due to the \textit{label agnostic} nature of the NTK, which renders the resulting kernel not as \textit{loc...More
0
Introduction
Highlights
  • The last decade has witnessed the huge success of deep neural networks (NNs) in various machine learning tasks (LeCun et al, 2015)
  • Starting from an random initialization, researchers demonstrate that the evolution of NNs in terms of predictions can be well captured by the following kernel gradient descent d
  • The second one is based on approximately solving neural tangent hierarchy (NTH), an infinite hierarchy of ordinary different equations that give a precise description of the training dynamics (1) (Huang & Yau, 2019), and we show this kernel approximates Kt(2) strictly better than EinitK0(2) does (Theorem 2.1)
  • In view of the above lemma, the two label-aware NTKs (LANTKs) we proposed can be regarded as truncated versions of (8), where the truncation happens at the second level
  • We find that LANTK is more locally elastic than neural tangent kernel (NTK), indicating that LANTK better simulates the qualitative behaviors of NNs
  • We proposed the notion of label-awareness to explain the performance gap between a model trained by NTK and real-world NNs
Methods
  • Though the authors have proved in Theorem 2.1 that K(NTH) possesses favorable theoretical guarantees, it’s computational cost is prohibitive for large-scale experiments.
  • In the rest of this section, all experiments on LANTKs are based on K(HR).
  • The authors compare the generalization ability of the LANTK to its label-agnostic counterpart in both binary and multi-class image classification tasks on CIFAR-10.
  • The implementation of CNTK is based on Novak et al (2020) and the details of the architecture can be found in Appx.
Conclusion
  • The authors proposed the notion of label-awareness to explain the performance gap between a model trained by NTK and real-world NNs.
  • The exact computation of K(NTH) requires at least O(n4) time, since the dimension of the matrix Einit[K0(4)(x, x , ·, ·)] is n2 × n2.
  • It would greatly improve the practical usage of the proposed kernels if there are more efficient implementations
Tables
  • Table1: Performance of LANTK , CNTK and CNN on binary image classification tasks on CIFAR10. Note that the best LANTK(indicated as LANTK-best) significantly outperforms CNTK. Here, “abs” stands for absolute improvement and “rel” stands for the relative error reduction
  • Table2: Performance of LANTK , CNTK, and CNN on multi-class image classification tasks on CIFAR-10. The improvement is more evident than the binary classification
  • Table3: Strength of local elasticity in binary classification tasks on CIFAR-10. The training makes NNs more locally elastic, and LANTK successfully simulates this behavior
Download tables as Excel
Related work
Funding
  • This work was in part supported by NSF through CAREER DMS-1847415 and CCF-1934876, an Alfred Sloan Research Fellowship, the Wharton Dean’s Research Fund, and Contract FA8750-19-20201 with the US Defense Advanced Research Projects Agency (DARPA)
Study subjects and analysis
pairs: 5
Binary classification. We first choose five pairs of categories on which the performance of the 2-layer fully-connected NTK is neither too high (o.w. the improvement will be marginal) nor too low (o.w. it may be difficult for Z to extract useful information). We then randomly sample 10000 examples as the training data and another 2000 as the test data, under the constraint that the sizes of positive and negative examples are equal

Reference
  • Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009.
    Google ScholarLocate open access versionFindings
  • Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? In Advances in Neural Information Processing Systems, pp. 9015–9025, 2019.
    Google ScholarLocate open access versionFindings
  • Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
    Findings
  • Kazuhiko Aomoto. Analytic structure of schläfli function. Nagoya Mathematical Journal, 68:1–16, 1977.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8139–8148, 2019a.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Simon S Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019b.
    Findings
  • Yu Bai and Jason D Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619, 2019.
    Findings
  • Yu Bai, Ben Krause, Huan Wang, Caiming Xiong, and Richard Socher. Taylorized training: Towards better approximation of neural network training at finite width. arXiv preprint arXiv:2002.04010, 2020.
    Findings
  • Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pp. 12873–12884, 2019.
    Google ScholarLocate open access versionFindings
  • Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360, 2019.
    Findings
  • Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. 2018.
    Google ScholarFindings
  • Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pp. 2933–2943, 2019.
    Google ScholarLocate open access versionFindings
  • Youngmin Cho and Lawrence K Saul. Large-margin classification in infinite neural networks. Neural computation, 22(10):2678–2697, 2010.
    Google ScholarLocate open access versionFindings
  • Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learning kernels. arXiv preprint arXiv:1205.2653, 2012.
    Findings
  • Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola. On kernel-target alignment. In Advances in neural information processing systems, pp. 367–373, 2002.
    Google ScholarLocate open access versionFindings
  • Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
    Findings
  • Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
    Findings
  • Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems, pp. 5724–5734, 2019.
    Google ScholarLocate open access versionFindings
  • Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491, 2019.
    Findings
  • Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
    Findings
  • Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of two-layers neural network. In Advances in Neural Information Processing Systems, pp. 9108–9118, 2019a.
    Google ScholarLocate open access versionFindings
  • Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019b.
    Findings
  • Mehmet Gönen and Ethem Alpaydin. Multiple kernel learning algorithms. Journal of machine learning research, 12(64):2211–2268, 2011.
    Google ScholarLocate open access versionFindings
  • Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
    Findings
  • Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
    Findings
  • Hangfeng He and Weijie J. Su. The local elasticity of neural networks. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Wassily Hoeffding et al. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293–325, 1948.
    Google ScholarLocate open access versionFindings
  • Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. arXiv preprint arXiv:1909.08156, 2019.
    Findings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
    Google ScholarLocate open access versionFindings
  • Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
    Google ScholarFindings
  • Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine learning research, 5(Jan):27–72, 2004.
    Google ScholarLocate open access versionFindings
  • Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Artificial Intelligence and Statistics, pp. 404–411, 2007.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
    Google ScholarLocate open access versionFindings
  • Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
    Findings
  • Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570–8581, 2019.
    Google ScholarLocate open access versionFindings
  • Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166, 2018.
    Google ScholarLocate open access versionFindings
  • Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019.
    Findings
  • Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
    Findings
  • Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
    Findings
  • Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents.
    Locate open access versionFindings
  • Jason M Ribando. Measuring solid angles beyond dimension three. Discrete & Computational Geometry, 36(3):479–487, 2006.
    Google ScholarLocate open access versionFindings
  • Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
    Google ScholarFindings
  • Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pp. 9709–9721, 2019. Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997. Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019. Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, pp. 6594–6604, 2019.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科