## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Triple descent and the two kinds of overfitting: Where & why do they appear?

NIPS 2020, (2020)

EI

Abstract

A recent line of research has highlighted the existence of a double descent phenomenon in deep learning, whereby increasing the number of training examples $N$ causes the generalization error of neural networks to peak when $N$ is of the same order as the number of parameters $P$. In earlier works, a similar phenomenon was shown to exis...More

Code:

Data:

Introduction

- A few years ago, deep neural networks achieved breakthroughs in a variety of contexts [1, 2, 3, 4].
- By studying the full (P, N ) phase space (Fig. 1, right), the authors disentangle the role of the linear and the nonlinear peaks in modern neural networks, and elucidate the role of the input dimension D.

Highlights

- A few years ago, deep neural networks achieved breakthroughs in a variety of contexts [1, 2, 3, 4]
- Recent developments show that deep neural networks, as well as other machine learning models, exhibit a starkly different behaviour
- Through a bias-variance decomposition of the test loss, we reveal that the linear peak is solely caused by overfitting the noise corrupting the labels, whereas the nonlinear peak is caused by the variance due to the initialization of the random feature vectors
- The triple descent curve presented here is of different nature: it stems from the general properties of nonlinear projections, rather than the particular structure chosen for the data [19] or regression kernel [36]
- As explained in [41], the Gaussian Equivalence Theorem [11, 42, 41] which applies in this high dimensional setting establishes an equivalence to a Gaussian covariate model where the nonlinear activation function is replaced by a linear term and a nonlinear term acting as noise:
- By elucidating the structure of the (P, N ) phase space, its dependency on D, and distinguishing the two different types of overfitting which it can exhibit, we believe our results can be of interest to practitioners

Results

- In Sec.1, the authors demonstrate that the linear and nonlinear peaks are two different phenomena by showing that they can co-exist in the (P, N ) phase space in noisy regression tasks.
- Through a bias-variance decomposition of the test loss, the authors reveal that the linear peak is solely caused by overfitting the noise corrupting the labels, whereas the nonlinear peak is caused by the variance due to the initialization of the random feature vectors.
- The triple descent curve presented here is of different nature: it stems from the general properties of nonlinear projections, rather than the particular structure chosen for the data [19] or regression kernel [36].
- The authors start by introducing the two models which the authors will study throughout the paper: on the analytical side, the random feature model, and on the numerical side, a teacher-student task involving neural networks trained with gradient descent.
- As explained in [41], the Gaussian Equivalence Theorem [11, 42, 41] which applies in this high dimensional setting establishes an equivalence to a Gaussian covariate model where the nonlinear activation function is replaced by a linear term and a nonlinear term acting as noise:
- This peak appears starkly at N = P in the high noise setup, where noise variance dominates the test loss, and in the noiseless setup (Fig. 6.b), where the residual initialization variance dominates: nonlinear networks can overfit even in absence of noise.
- In Fig. 7, the authors consider RF models with four different activation functions: absolute value (r = 0), ReLU (r = 0.5), Tanh (r ⇠ 0.92) and linear (r = 1).
- In Sec. A of the SM, the authors present additional results where the degree of linearity r is varied systematically in the RF model, and show that replacing Tanh by ReLU in the NN setup produces a similar effect.

Conclusion

- This can be understood by the fact that the linear peak is already implicitly regularized by the nonlinearity for r < 1, as explained in Sec. 2.
- The authors believe it will provide a deeper insight into data-model matching

Summary

- A few years ago, deep neural networks achieved breakthroughs in a variety of contexts [1, 2, 3, 4].
- By studying the full (P, N ) phase space (Fig. 1, right), the authors disentangle the role of the linear and the nonlinear peaks in modern neural networks, and elucidate the role of the input dimension D.
- In Sec.1, the authors demonstrate that the linear and nonlinear peaks are two different phenomena by showing that they can co-exist in the (P, N ) phase space in noisy regression tasks.
- Through a bias-variance decomposition of the test loss, the authors reveal that the linear peak is solely caused by overfitting the noise corrupting the labels, whereas the nonlinear peak is caused by the variance due to the initialization of the random feature vectors.
- The triple descent curve presented here is of different nature: it stems from the general properties of nonlinear projections, rather than the particular structure chosen for the data [19] or regression kernel [36].
- The authors start by introducing the two models which the authors will study throughout the paper: on the analytical side, the random feature model, and on the numerical side, a teacher-student task involving neural networks trained with gradient descent.
- As explained in [41], the Gaussian Equivalence Theorem [11, 42, 41] which applies in this high dimensional setting establishes an equivalence to a Gaussian covariate model where the nonlinear activation function is replaced by a linear term and a nonlinear term acting as noise:
- This peak appears starkly at N = P in the high noise setup, where noise variance dominates the test loss, and in the noiseless setup (Fig. 6.b), where the residual initialization variance dominates: nonlinear networks can overfit even in absence of noise.
- In Fig. 7, the authors consider RF models with four different activation functions: absolute value (r = 0), ReLU (r = 0.5), Tanh (r ⇠ 0.92) and linear (r = 1).
- In Sec. A of the SM, the authors present additional results where the degree of linearity r is varied systematically in the RF model, and show that replacing Tanh by ReLU in the NN setup produces a similar effect.
- This can be understood by the fact that the linear peak is already implicitly regularized by the nonlinearity for r < 1, as explained in Sec. 2.
- The authors believe it will provide a deeper insight into data-model matching

Related work

- Various sources of sample-wise non-monotonicity have been observed since the 1990s, from linear regression [14] to simple classification tasks [31, 32]. In the context of adversarial training, [33] shows that increasing N can help or hurt generalization depending on the strength of the adversary. In the non-parametric setting of [34], an upper bound on the test loss is shown to exhibit multiple descent, with peaks at each N = Di, i 2 N. Two concurrent papers also discuss the existence of a triple descent curve, albeit of different nature to ours. On one hand, [19] observes a sample-wise triple descent in a non-isotropic linear regression task. In their setup, the two peaks stem from the block structure of the covariance of the input data, which presents two eigenspaces of different variance; both peaks boil down to what we call “linear peaks”. [35] pushed this idea to the extreme by designing the covariance matrix in such a way to make an arbitrary number of linear peaks appear.

Funding

- GB acknowledges funding from the French government under management of Agence Nationale de la Recherche as part of the "Investissements d’avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and from the Simons Foundation collaboration Cracking the Glass Problem (No 454935 to G

Reference

- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
- Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
- Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
- Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.
- S Spigler, M Geiger, S d’Ascoli, L Sagun, G Biroli, and M Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019.
- Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
- Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.
- Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent. arXiv preprint arXiv:1912.07242, 2019.
- Manfred Opper and Wolfgang Kinzel. Statistical mechanics of generalization. In Models of neural networks III, pages 151–209.
- Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001.
- Florent Krzakala and Jorge Kurchan. Landscape analysis of constraint satisfaction problems. Physical Review E, 76(2):021122, 2007.
- Silvio Franz and Giorgio Parisi. The simplest model of jamming. Journal of Physics A: Mathematical and Theoretical, 49(14):145001, 2016.
- Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E, 100(1):012115, 2019.
- Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020.
- Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble in double descent: Bias and variance (s) in the lazy regime. arXiv preprint arXiv:2003.01054, 2020.
- Yann Le Cun, Ido Kanter, and Sara A Solla. Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters, 66(18):2396, 1991.
- Anders Krogh and John A Hertz. Generalization in a linear perceptron in the presence of noise. Journal of Physics A: Mathematical and General, 25(5):1135, 1992.
- Robert PW Duin. Small sample size generalization. In Proceedings of the Scandinavian Conference on Image Analysis, volume 2, pages 957–964. PROCEEDINGS PUBLISHED BY VARIOUS PUBLISHERS, 1995.
- Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, and Alyson K Fletcher. Generalization error of generalized linear models in high dimensions. arXiv preprint arXiv:2005.00180, 2020.
- Benjamin Aubin, Florent Krzakala, Yue M Lu, and Lenka Zdeborová. Generalization error in high-dimensional perceptrons: Approaching bayes error with convex optimization. arXiv preprint arXiv:2006.06560, 2020.
- Zeng Li, Chuanlong Xie, and Qinwen Wang. Provable more data hurt in high dimensional least squares estimator. arXiv preprint arXiv:2008.06296, 2020.
- Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374, 2018.
- Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
- Marco Loog and Robert PW Duin. The dipping phenomenon. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 310–317.
- Marco Loog, Tom Viering, and Alexander Mey. Minimizers of the empirical risk and risk monotonicity. In Advances in Neural Information Processing Systems, pages 7478–7487, 2019.
- Yifei Min, Lin Chen, and Amin Karbasi. The curious case of adversarially robust models: More data can help, double descend, or hurt generalization. arXiv preprint arXiv:2002.11080, 2020.
- Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the risk of minimum-norm interpolants and restricted lower isometry of kernels. arXiv preprint arXiv:1908.10292, 2019.
- Lin Chen, Yifei Min, Mikhail Belkin, and Amin Karbasi. Multiple descent: Design your own generalization curve. arXiv preprint arXiv:2008.01036, 2020.
- Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. arXiv preprint arXiv:2008.06786, 2020.
- Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. arXiv preprint arXiv:2002.09339, 2020.
- Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, pages 2637–2646, 2017.
- Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 2933–2943. Curran Associates, Inc., 2019.
- S Péché et al. A note on the pennington-worah distribution. Electronic Communications in Probability, 24, 2019.
- Sebastian Goldt, Galen Reeves, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. The gaussian equivalence of generative models for learning with two-layer neural networks. arXiv preprint arXiv:2006.14709, 2020.
- Lucas Benigni and Sandrine Péché. Eigenvalue distribution of nonlinear models of random matrices. arXiv preprint arXiv:1904.03090, 2019.
- Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.
- Zhenyu Liao and Romain Couillet. On the spectrum of random features maps of high dimensional data. arXiv preprint arXiv:1805.11916, 2018.
- Madhu Advani, Subhaneil Lahiri, and Surya Ganguli. Statistical mechanics of complex neural systems and high dimensional data. Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03014, 2013.
- Thomas Dupic and Isaac Pérez Castillo. Spectral density of products of wishart dilute random matrices. part i: the dense case. arXiv preprint arXiv:1401.7802, 2014.
- Gernot Akemann, Jesper R Ipsen, and Mario Kieburg. Products of rectangular random matrices: singular values and progressive scattering. Physical Review E, 88(5):052118, 2013.
- Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.

Tags

Comments