## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# On the training dynamics of deep networks with L2L2L_2 regularization

NIPS 2020, (2020)

下载 PDF 全文

微博一下

摘要

We study the role of L2 regularization in deep learning, and uncover simple relations between the performance of the model, the L2 coefficient, the learning rate, and the number of training steps. These empirical relations hold when the network is overparameterized. They can be used to predict the optimal regularization parameter of a giv...更多

代码：

数据：

简介

- Machine learning models are commonly trained with L2 regularization. This involves adding the term

1 2 k✓k22 to the loss function, where ✓ is the vector of model parameters and is a hyperparameter. - Machine learning models are commonly trained with L2 regularization.
- This involves adding the term.
- In the context of linear regression, L2 regularization increases the bias of the learned parameters while reducing their variance across instantiations of the training data; in other words, it is a manifestation of the bias-variance tradeoff.
- The use of L2 regularization is prevalent and often leads to improved performance in practical settings [Hinton, 1986], the theoretical motivation for its use is less clear.
- The goal of this paper is to improve the understanding of the role of L2 regularization in deep learning

重点内容

- Machine learning models are commonly trained with L2 regularization
- We study the role of L2 regularization when training over-parameterized deep networks, taken here to mean networks that can achieve training accuracy 1 when trained with stochastic gradient descent (SGD)
- As to the AUTOL2 algorithm, we find that in the presence of learning rate schedules it does not perform as well as a tuned but constant L2 parameter
- The L2 regularization parameter, and (2) the performance reached in this way is independent of when is not too large
- The second is AUTOL2, an automatic L2 parameter schedule. This method leads to better performance and faster training when compared against training with a tuned L2 parameter
- Based on these observations we propose a dynamical schedule for the regularization parameter that improves performance and speeds up training
- We find that these proposals work well when training with a constant learning rate; we leave an extension of these methods to networks trained with learning rate schedules to future work

方法

- Performance and time scales.
- The authors turn to an empirical study of networks trained with L2 regularization.
- The authors present results for a fully-connected network trained on MNIST, a Wide ResNet [Zagoruyko and Komodakis, 2016] trained on CIFAR-10, and CNNs trained on CIFAR-10.
- The empirical findings discussed in section 1.1 hold across this variety of overparameterized setups.
- =0.0025 =2.0 0.05 t 0.5 · · (a) FC Test accuracy =2.0 =0 (b) FC.
- Epochs to max test acc

结果

- Based on these observations the authors propose a dynamical schedule for the regularization parameter that improves performance and speeds up training.

结论

- In this work the authors consider the effect of L2 regularization on overparameterized networks.
- The L2 regularization parameter, and (2) the performance reached in this way is independent of when is not too large
- The authors find that these observations hold for a variety of overparameterized training setups; see the SM for some examples where they do not hold.
- The second is AUTOL2, an automatic L2 parameter schedule
- In the experiments, this method leads to better performance and faster training when compared against training with a tuned L2 parameter.
- The authors find that these proposals work well when training with a constant learning rate; the authors leave an extension of these methods to networks trained with learning rate schedules to future work

总结

## Introduction:

Machine learning models are commonly trained with L2 regularization. This involves adding the term

1 2 k✓k22 to the loss function, where ✓ is the vector of model parameters and is a hyperparameter.- Machine learning models are commonly trained with L2 regularization.
- This involves adding the term.
- In the context of linear regression, L2 regularization increases the bias of the learned parameters while reducing their variance across instantiations of the training data; in other words, it is a manifestation of the bias-variance tradeoff.
- The use of L2 regularization is prevalent and often leads to improved performance in practical settings [Hinton, 1986], the theoretical motivation for its use is less clear.
- The goal of this paper is to improve the understanding of the role of L2 regularization in deep learning
## Objectives:

The goal of this paper is to improve the understanding of the role of L2 regularization in deep learning.## Methods:

Performance and time scales.- The authors turn to an empirical study of networks trained with L2 regularization.
- The authors present results for a fully-connected network trained on MNIST, a Wide ResNet [Zagoruyko and Komodakis, 2016] trained on CIFAR-10, and CNNs trained on CIFAR-10.
- The empirical findings discussed in section 1.1 hold across this variety of overparameterized setups.
- =0.0025 =2.0 0.05 t 0.5 · · (a) FC Test accuracy =2.0 =0 (b) FC.
- Epochs to max test acc
## Results:

Based on these observations the authors propose a dynamical schedule for the regularization parameter that improves performance and speeds up training.## Conclusion:

In this work the authors consider the effect of L2 regularization on overparameterized networks.- The L2 regularization parameter, and (2) the performance reached in this way is independent of when is not too large
- The authors find that these observations hold for a variety of overparameterized training setups; see the SM for some examples where they do not hold.
- The second is AUTOL2, an automatic L2 parameter schedule
- In the experiments, this method leads to better performance and faster training when compared against training with a tuned L2 parameter.
- The authors find that these proposals work well when training with a constant learning rate; the authors leave an extension of these methods to networks trained with learning rate schedules to future work

相关工作

- L2 regularization in the presence of batch-normalization [Ioffe and Szegedy, 2015] has been studied in [van Laarhoven, 2017, Hoffer et al, 2018, Zhang et al, 2018]. These papers discussed how the effect of L2 on scale invariant models is merely of having an effective learning rate (and no L2). This was made precise in Li and Arora [2019] where they showed that this effective learning rate is ⌘e↵ = ⌘e2⌘ t (at small learning rates). Our theoretical analysis of large width networks will have has the same behaviour when the network is scale invariant. Finally, in parallel to this work, Li et al [2020] carried out a complementary analysis of the role of L2 regularization in deep learning using a stochastic differential equation analysis. Their conclusions regarding the effective learning rate in the presence of L2 regularization are consistent with our observations.

基金

- Based on these observations we propose a dynamical schedule for the regularization parameter that improves performance and speeds up training

研究对象与分析

MNIST samples: 512

Wide ResNet 28-10 trained on CIFAR-10 with momentum and data augmentation. (a) Final test accuracy vs. the L2 parameter . When the network is trained for a fixed amount of epochs, optimal performance is achieved at a certain value of . But when trained for a time proportional to 1, performance plateaus and remains constant down to the lowest values of tested. This experiment includes a learning rate schedule. (b) Test accuracy vs. training epochs for predicted optimal L2 parameter compared with the tuned parameter. (c) Training curves with our dynamical L2 schedule, compared with a tuned, constant L2 parameter. Sweep over ⌘ and illustrating how smaller ’s require longer times to achieve the same performance. In the left, middle plots, the learning rates are logarithmically spaced between the values displayed in the legend, the specific values are in the SM A. Left: Epochs to maximum test accuracy (within .5%), Middle: Maximum test accuracy (the = 0 line denotes the maximum test accuracy achieved among all learning rates), Right: Maximum test accuracy for a fixed time budget. (a,b,c) Fully connected 3-hidden layer neural network evaluated in 512 MNIST samples, evolved for t · ⌘ · = 2. ⌘ = 0.15 in (c). (d,e,f) A Wide Residual Network 28-10 trained on CIFAR-10 without data augmentation, evolved for t · ⌘ · = 0.1. In (f), ⌘ = 0.2. The = 0 line was evolved for longer than the smallest L2 but there is still a gap. CNNs trained with and without batch-norm with learning rate ⌘ = 0.01. Presented results follow the same format as Figure 2

引用论文

- Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148, 2019.
- Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mand al. Reconciling modern machine learning practice and the bias-variance trade-off. arXiv e-prints, art. arXiv:1812.11118, December 2018.
- James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs. 2018. URL http://github.com/google/jax.
- Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id= S1gFvANKDS.
- Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2(2):023401, February 2020. doi: 10.1088/1742-5468/ab633c.
- Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. arXiv e-prints, art. arXiv:1901.10159, January 2019.
- Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient Descent Happens in a Tiny Subspace. arXiv e-prints, art. arXiv:1812.04754, December 2018.
- G. E. Hinton. Learning distributed representations of concepts. Proc. of Eighth Annual Conference of the Cognitive Science Society, 1986, 1986.
- Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks, 2018.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
- Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8571–8580. Curran Associates, Inc., 2018.
- Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism, 2020.
- Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning, 2019. Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. arXiv preprint arXiv:2010.02916, 2020. Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring Generalization in Deep Learning. arXiv e-prints, art. arXiv:1706.08947, June 2017. Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn