## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks

NIPS 2020, (2020)

EI

Keywords

Abstract

Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed ...More

Code:

Data:

Introduction

- Many important tasks in deep learning can be represented by the following optimization problem, 1 min f (w) := n f (i)(w) , (1).
- Given an initial point w0, the SGD algorithm is based on the following recursion, wk+1 = wk − η∇fk(wk) with ∇fk(w) := 1 B.
- In contrast to convex optimization setting where the behavior of SGD is fairly well-understood, the generalization properties of SGD in non-convex deep learning problems is an active area of research [PBL19, AZL19, AZLL19].
- There has been considerable progress around this topic, where several generalization bounds have been proven in

Highlights

- Many important tasks in deep learning can be represented by the following optimization problem, 1 min f (w) := n f (i)(w), (1)

w∈Rd n i=1 where w ∈ Rd denotes the network weights, n denotes the number of training data points, f denotes a non-convex cost function, and f (i) denotes the cost incurred by a single data point - (ii) By using tools from geometric measure theory, we prove that the generalization error can be controlled by the Hausdorff dimension of the process, which can be significantly smaller than the standard Euclidean dimension
- Generalization bounds via Hausdorff dimension. This part provides the main contribution of this paper, where we show that the generalization error of a training algorithm can be controlled by the Hausdorff dimension of its trajectories
- We rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon. We established this relationship through the Hausdorff dimension of the stochastic differential equations (SDE) approximating the algorithm, and proved a generalization error bound based on this notion of complexity
- Unlike the standard ambient dimension, our bounds do not necessarily grow with the number of parameters in the network, and they solely depend on the tail behavior of the training process, providing an explanation for the implicit regularization effect of heavy-tailed Stochastic gradient descent (SGD)
- Our results suggest that the fractal structure and the fractal dimensions of deep learning models can be an accurate metric for the generalization error; in a broader context, we believe that our theory would be useful for practitioners using deep learning tools

Methods

- The authors empirically study the generalization behavior of deep neural networks from the Hausdorff dimension perspective.
- The authors use VGG networks [SZ15] as they perform well in practice, and their depth can be controlled directly.
- The authors vary the number of layers from D = 4 to D = 19, resulting in the number of parameters d between 1.3M and 20M.
- The authors provide full range of parameters and additional implementation details in the supplementary document.
- The code can be found in https://github.com/umutsimsekli/ Hausdorff-Dimension-and-Generalization

Conclusion

- The authors rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon.
- The authors established this relationship through the Hausdorff dimension of the SDE approximating the algorithm, and proved a generalization error bound based on this notion of complexity.
- The authors' work does not have a direct ethical or societal consequence due to its theoretical nature

Summary

## Introduction:

Many important tasks in deep learning can be represented by the following optimization problem, 1 min f (w) := n f (i)(w) , (1).- Given an initial point w0, the SGD algorithm is based on the following recursion, wk+1 = wk − η∇fk(wk) with ∇fk(w) := 1 B.
- In contrast to convex optimization setting where the behavior of SGD is fairly well-understood, the generalization properties of SGD in non-convex deep learning problems is an active area of research [PBL19, AZL19, AZLL19].
- There has been considerable progress around this topic, where several generalization bounds have been proven in
## Objectives:

The authors aim to take a first step in this direction and prove novel generalization bounds in the case where the trajectories of the optimization algorithm can be well-approximated by a Feller process [Sch16], which form a broad class of Markov processes that includes many important stochastic processes as a special case.## Methods:

The authors empirically study the generalization behavior of deep neural networks from the Hausdorff dimension perspective.- The authors use VGG networks [SZ15] as they perform well in practice, and their depth can be controlled directly.
- The authors vary the number of layers from D = 4 to D = 19, resulting in the number of parameters d between 1.3M and 20M.
- The authors provide full range of parameters and additional implementation details in the supplementary document.
- The code can be found in https://github.com/umutsimsekli/ Hausdorff-Dimension-and-Generalization
## Conclusion:

The authors rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon.- The authors established this relationship through the Hausdorff dimension of the SDE approximating the algorithm, and proved a generalization error bound based on this notion of complexity.
- The authors' work does not have a direct ethical or societal consequence due to its theoretical nature

Funding

- The contribution of U.S . to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project

Reference

- [AAV18] Amir Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pages 7234–7243, 2018.
- [AB99] Siu-Kui Au and James L Beck. A new adaptive importance sampling scheme for reliability calculations. Structural Safety, 21(2):135–158, 1999.
- [AB09] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
- [Ass83] Patrick Assouad. Densité et dimension. In Annales de l’Institut Fourier, volume 33, pages 233–282, 1983.
- [AZL19] Zeyuan Allen-Zhu and Yuanzhi Li. Can sgd learn recurrent neural networks with provable generalization? In Advances in Neural Information Processing Systems, pages 10331–10341, 2019.
- [AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155–6166, 2019.
- [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2(Mar), 2002.
- [BG60] Robert M Blumenthal and Ronald K Getoor. Some theorems on stable processes. Transactions of the American Mathematical Society, 95(2):263–273, 1960.
- [BJSG+18] Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, Gerard Ben Arous, Chiara Cammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep neural networks versus glassy systems. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 314–323, 10–15 Jul 2018.
- [BP17] Christopher J Bishop and Yuval Peres. Fractals in probability and analysis. Cambridge University Press, 2017.
- [Bra83] Richard C Bradley. On the ψ-mixing condition for stationary random sequences. Transactions of the American Mathematical Society, 276(1):55–66, 1983.
- [BSW13] B Böttcher, R Schilling, and J Wang. Lévy matters. iii. lévy-type processes: construction, approximation and sample path properties. Lecture Notes in Mathematics, 2099, 2013.
- [Cou65] Philippe Courrege. Sur la forme intégro-différentielle des opérateurs de c∞ k dans c satisfaisant au principe du maximum. Séminaire Brelot-Choquet-Deny. Théorie du Potentiel, 10(1):1–38, 1965.
- [CS18] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ICLR, 2018.
- [Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society, 79(3):651–676, 2017.
- [DDB19] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. The Annals of Statistics (to appear), 2019.
- [DR17] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- [DSD19] Nadav Dym, Barak Sober, and Ingrid Daubechies. Expression of fractals through neural network functions. arXiv preprint:1905.11345, 2019.
- [EH20] Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness. arXiv preprint arXiv:2005.13097, 2020.
- [EMG90] GA Edgar, Topology Measure, and Fractal Geometry. Undergraduate texts in mathematics, 1990.
- [EMS18] Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization with discretized diffusions. In Advances in Neural Information Processing Systems, pages 9671–9680, 2018.
- [Fal04] Kenneth Falconer. Fractal geometry: mathematical foundations and applications. John Wiley & Sons, 2004.
- [FFP20] Stefano Favaro, Sandra Fortini, and Stefano Peluchetti. Stable behaviour of infinitely wide deep neural networks. In AISTATS, 2020.
- [GSZ20] Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. arXiv preprint arXiv:2006.04740, 2020.
- [HDS18] Qiao Huang, Jinqiao Duan, and Renming Song. Homogenization of stable-like feller processes. arXiv preprint:1812.11624, 2018.
- [Hen73] WJ Hendricks. A dimension theorem for sample functions of processes with stable components. The Annals of Probability, pages 849–853, 1973.
- [HLLL17] W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint:1705.07562, 2017.
- [HM20] Liam Hodgkinson and Michael W Mahoney. Multiplicative noise and heavy tails in stochastic optimization. arXiv preprint arXiv:2006.06293, 2020.
- [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
- [IP06] Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable lévy processes. Stochastic Processes and their Applications, 116(4):611–642, 2006.
- [Jac02] Niels Jacob. Pseudo Differential Operators And Markov Processes: Volume II: Generators and Their Potential Theory. World Scientific, 2002.
- [JKA+17] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. arXiv preprint:1711.04623, 2017.
- [KH09] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- [KL17] Ilja Kuzborskij and Christoph H Lampert. Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678, 2017.
- [Lév37] P. Lévy. Théorie de l’addition des variables aléatoires. Gauthiers-Villars, Paris, 1937.
- [LG19] Ronan Le Guével. The hausdorff dimension of the range of the lévy multistable processes. Journal of Theoretical Probability, 32(2):765–780, 2019.
- [Lon17] Ben London. A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2931–2940, 2017.
- [LY19] József Lorinczi and Xiaochuan Yang. Multifractal properties of sample paths of ground state-transformed jump processes. Chaos, Solitons & Fractals, 120:83–94, 2019.
- [Mat99] Pertti Mattila. Geometry of sets and measures in Euclidean spaces: fractals and rectifiability. Cambridge university press, 1999.
- [MBM16] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. arXiv preprint arXiv:1607.06534, 2016.
- [MHB16] S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In ICML, 2016.
- [MM19] Charles H Martin and Michael W Mahoney. Traditional and heavy-tailed self regularization in neural network models. In ICML, 2019.
- [MMO15] Mohammad Mohammadi, Adel Mohammadpour, and Hiroaki Ogata. On estimating the tail index and the spectral measure of multivariate α-stable distributions. Metrika, 78(5):549–561, 2015.
- [MSS19] Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? In NeurIPS, 2019.
- [MWZZ17] Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.
- [MX05] Mark M Meerschaert and Yimin Xiao. Dimension results for sample paths of operator stable Lévy processes. Stochastic processes and their applications, 115(1):55–75, 2005.
- [NBMS17] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In NeurIPS, 2017.
- [NHD+19] Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, pages 11015–11025, 2019.
- [NS GR19] Thanh Huy Nguyen, Umut Simsekli, Mert Gürbüzbalaban, and Gaël Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In NeurIPS, 2019.
- [NTS15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
- [Pav07] Ilya Pavlyukevich. Cooling down lévy flights. Journal of Physics A: Mathematical and Theoretical, 40(41):12299, 2007.
- [PBL19] Tomaso Poggio, Andrzej Banburski, and Qianli Liao. Theoretical issues in deep networks: Approximation, optimization and generalization. arXiv preprint arXiv:1908.09375, 2019.
- [PSGN19] Abhishek Panigrahi, Raghav Somani, Navin Goyal, and Praneeth Netrapalli. NonGaussianity of stochastic gradient noise. arXiv preprint:1910.09626, 2019.
- [RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint:1702.03849, 2017.
- [RZ19] Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. Transactions on Information Theory, 66(1):302–323, 2019.
- [Sat99] Ken-iti Sato. Lévy processes and infinitely divisible distributions. Cambridge university press, 1999.
- [Sch98] René L Schilling. Feller processes generated by pseudo-differential operators: On the hausdorff dimension of their sample paths. Journal of Theoretical Probability, 11(2):303–330, 1998.
- [Sch16] René L. Schilling. An introduction to lévy and feller processes. In D. Khoshnevisan and R. Schilling, editors, Lévy-type processes to parabolic SPDEs. Birkhäuser, Cham, 2016.
- [S GN+19] Umut Simsekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018, 2019.
- [SHTY13] Mahito Sugiyama, Eiju Hirowatari, Hideki Tsuiki, and Akihiro Yamamoto. Learning figures with the hausdorff metric by fractals—towards computable binary classification. Machine learning, 90(1):91–126, 2013.
- [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
- [SSG19] Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In ICML, 2019.
- [ST94] G. Samorodnitsky and M. S. Taqqu. Stable non-Gaussian random processes: stochastic models with infinite variance, volume 1. CRC press, 1994.
- [SZ15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Yoshua Bengio and Yann LeCun, editors, ICLR, 2015.
- [Xia03] Yimin Xiao. Random fractals and markov processes. Mathematics Preprint Archive, 2003(6):830–907, 2003.
- [XR17] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In NeurIPS, 2017.
- [XZ20] Longjie Xie and Xicheng Zhang. Ergodicity of stochastic differential equations with jumps and singular coefficients. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 56, pages 175–229. Institut Henri Poincaré, 2020.
- [Yan18] Xiaochuan Yang. Multifractality of jump diffusion processes. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 54, pages 2042–2074. Institut Henri Poincaré, 2018.
- [ZKV+19] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why ADAM beats SGD for attention models. arXiv preprint:1912.03194, 2019.
- [ZLZ19] Yi Zhou, Yingbin Liang, and Huishuai Zhang. Understanding generalization error of sgd in nonconvex optimization. stat, 1050:7, 2019.
- [ZWY+19] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In ICML, 2019.

Tags

Comments