## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Label-Aware Neural Tangent Kernel: Toward Better Generalization and Local Elasticity

NIPS 2020, (2020)

EI

Keywords

Abstract

As a popular approach to modeling the dynamics of training overparametrized neural networks (NNs), the neural tangent kernels (NTK) are known to fall behind real-world NNs in generalization ability. This performance gap is in part due to the \textit{label agnostic} nature of the NTK, which renders the resulting kernel not as \textit{loc...More

Introduction

- The last decade has witnessed the huge success of deep neural networks (NNs) in various machine learning tasks (LeCun et al, 2015).
- A venerable line of work relates overparametrized NNs to kernel regression from the perspective of their training dynamics, providing positive evidence towards understanding the optimization and generalization of NNs (Jacot et al, 2018; Chizat & Bach, 2018; Lee et al, 2019; Arora et al, 2019a; Chizat et al, 2019; Du et al, 2019; Li et al, 2019).
- An infinitely wide NNs is “equivalent” to kernel regression with a deterministic kernel in the training process

Highlights

- The last decade has witnessed the huge success of deep neural networks (NNs) in various machine learning tasks (LeCun et al, 2015)
- Starting from an random initialization, researchers demonstrate that the evolution of NNs in terms of predictions can be well captured by the following kernel gradient descent d
- The second one is based on approximately solving neural tangent hierarchy (NTH), an infinite hierarchy of ordinary different equations that give a precise description of the training dynamics (1) (Huang & Yau, 2019), and we show this kernel approximates Kt(2) strictly better than EinitK0(2) does (Theorem 2.1)
- In view of the above lemma, the two label-aware NTKs (LANTKs) we proposed can be regarded as truncated versions of (8), where the truncation happens at the second level
- We find that LANTK is more locally elastic than neural tangent kernel (NTK), indicating that LANTK better simulates the qualitative behaviors of NNs
- We proposed the notion of label-awareness to explain the performance gap between a model trained by NTK and real-world NNs

Methods

- Though the authors have proved in Theorem 2.1 that K(NTH) possesses favorable theoretical guarantees, it’s computational cost is prohibitive for large-scale experiments.
- In the rest of this section, all experiments on LANTKs are based on K(HR).
- The authors compare the generalization ability of the LANTK to its label-agnostic counterpart in both binary and multi-class image classification tasks on CIFAR-10.
- The implementation of CNTK is based on Novak et al (2020) and the details of the architecture can be found in Appx.

Conclusion

- The authors proposed the notion of label-awareness to explain the performance gap between a model trained by NTK and real-world NNs.
- The exact computation of K(NTH) requires at least O(n4) time, since the dimension of the matrix Einit[K0(4)(x, x , ·, ·)] is n2 × n2.
- It would greatly improve the practical usage of the proposed kernels if there are more efficient implementations

- Table1: Performance of LANTK , CNTK and CNN on binary image classification tasks on CIFAR10. Note that the best LANTK(indicated as LANTK-best) significantly outperforms CNTK. Here, “abs” stands for absolute improvement and “rel” stands for the relative error reduction
- Table2: Performance of LANTK , CNTK, and CNN on multi-class image classification tasks on CIFAR-10. The improvement is more evident than the binary classification
- Table3: Strength of local elasticity in binary classification tasks on CIFAR-10. The training makes NNs more locally elastic, and LANTK successfully simulates this behavior

Related work

- Kernels and NNs. Starting from Neal (1996), a line of work considers infinitely wide NNs whose parameters are chosen randomly and only the last layer is optimized (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015; Lee et al, 2017; Matthews et al, 2018; Novak et al, 2018; Garriga-Alonso et al, 2018; Yang, 2019). When the loss is the least squares loss, this gives rise to a class of interesting kernels different from the NTK. On the other hand, if all layers are trained by gradient descent, infinitely wide NNs give rise to the NTK (Jacot et al, 2018; Chizat & Bach, 2018; Lee et al, 2019; Arora et al, 2019a; Chizat et al, 2019; Du et al, 2019; Li et al, 2019), and the NTK also appears implicitly in many works when studying the optimization trajectories of NN training (Li & Liang, 2018; Allen-Zhu et al, 2018; Du et al, 2018; Ji & Telgarsky, 2019; Chen et al, 2019).

Limitations of the NTK and corrections. Arora et al (2019b) demonstrates that in many small datasets, models trained by the NTK can outperform its corresponding NN. But for moderately large scale tasks and for practical architectures, the performance gap between the two is empirically observed in many places and further confirmed by a series of theoretical works (Chizat et al, 2019; Ghorbani et al, 2019b; Yehudai & Shamir, 2019; Bietti & Mairal, 2019; Ghorbani et al, 2019a; Wei et al, 2019; Allen-Zhu & Li, 2019). This observation motivates various attempts to mitigate the gap, such as incorporating pooling layers and data augmentation into the NTK (Li et al, 2019), deriving higher-order expansions around the initialization Bai & Lee (2019); Bai et al (2020), doing finite-width correction (Hanin & Nica, 2019), and most related to our work, the NTH (Huang & Yau, 2019).

Funding

- This work was in part supported by NSF through CAREER DMS-1847415 and CCF-1934876, an Alfred Sloan Research Fellowship, the Wharton Dean’s Research Fund, and Contract FA8750-19-20201 with the US Defense Advanced Research Projects Agency (DARPA)

Study subjects and analysis

pairs: 5

Binary classification. We first choose five pairs of categories on which the performance of the 2-layer fully-connected NTK is neither too high (o.w. the improvement will be marginal) nor too low (o.w. it may be difficult for Z to extract useful information). We then randomly sample 10000 examples as the training data and another 2000 as the test data, under the constraint that the sizes of positive and negative examples are equal

Reference

- Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009.
- Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? In Advances in Neural Information Processing Systems, pp. 9015–9025, 2019.
- Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
- Kazuhiko Aomoto. Analytic structure of schläfli function. Nagoya Mathematical Journal, 68:1–16, 1977.
- Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8139–8148, 2019a.
- Sanjeev Arora, Simon S Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019b.
- Yu Bai and Jason D Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619, 2019.
- Yu Bai, Ben Krause, Huan Wang, Caiming Xiong, and Richard Socher. Taylorized training: Towards better approximation of neural network training at finite width. arXiv preprint arXiv:2002.04010, 2020.
- Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pp. 12873–12884, 2019.
- Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360, 2019.
- Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. 2018.
- Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pp. 2933–2943, 2019.
- Youngmin Cho and Lawrence K Saul. Large-margin classification in infinite neural networks. Neural computation, 22(10):2678–2697, 2010.
- Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learning kernels. arXiv preprint arXiv:1205.2653, 2012.
- Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola. On kernel-target alignment. In Advances in neural information processing systems, pp. 367–373, 2002.
- Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
- Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems, pp. 5724–5734, 2019.
- Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491, 2019.
- Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
- Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of two-layers neural network. In Advances in Neural Information Processing Systems, pp. 9108–9118, 2019a.
- Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019b.
- Mehmet Gönen and Ethem Alpaydin. Multiple kernel learning algorithms. Journal of machine learning research, 12(64):2211–2268, 2011.
- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. JMLR. org, 2017.
- Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
- Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
- Hangfeng He and Weijie J. Su. The local elasticity of neural networks. In International Conference on Learning Representations, 2020.
- Wassily Hoeffding et al. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293–325, 1948.
- Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. arXiv preprint arXiv:1909.08156, 2019.
- Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
- Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
- A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
- Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine learning research, 5(Jan):27–72, 2004.
- Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Artificial Intelligence and Statistics, pp. 404–411, 2007.
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570–8581, 2019.
- Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166, 2018.
- Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019.
- Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
- Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
- Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents.
- Jason M Ribando. Measuring solid angles beyond dimension three. Discrete & Computational Geometry, 36(3):479–487, 2006.
- Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2015.
- Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pp. 9709–9721, 2019. Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997. Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019. Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, pp. 6594–6604, 2019.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn