AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our results suggest that the fractal structure and the fractal dimensions of deep learning models can be an accurate metric for the generalization error; in a broader context, we believe that our theory would be useful for practitioners using deep learning tools

Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks

NIPS 2020, (2020)

Cited by: 0|Views8
EI
Full Text
Bibtex
Weibo

Abstract

Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed ...More

Code:

Data:

0
Introduction
  • Many important tasks in deep learning can be represented by the following optimization problem, 1 min f (w) := n f (i)(w) , (1).
  • Given an initial point w0, the SGD algorithm is based on the following recursion, wk+1 = wk − η∇fk(wk) with ∇fk(w) := 1 B.
  • In contrast to convex optimization setting where the behavior of SGD is fairly well-understood, the generalization properties of SGD in non-convex deep learning problems is an active area of research [PBL19, AZL19, AZLL19].
  • There has been considerable progress around this topic, where several generalization bounds have been proven in
Highlights
  • Many important tasks in deep learning can be represented by the following optimization problem, 1 min f (w) := n f (i)(w), (1)

    w∈Rd n i=1 where w ∈ Rd denotes the network weights, n denotes the number of training data points, f denotes a non-convex cost function, and f (i) denotes the cost incurred by a single data point
  • (ii) By using tools from geometric measure theory, we prove that the generalization error can be controlled by the Hausdorff dimension of the process, which can be significantly smaller than the standard Euclidean dimension
  • Generalization bounds via Hausdorff dimension. This part provides the main contribution of this paper, where we show that the generalization error of a training algorithm can be controlled by the Hausdorff dimension of its trajectories
  • We rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon. We established this relationship through the Hausdorff dimension of the stochastic differential equations (SDE) approximating the algorithm, and proved a generalization error bound based on this notion of complexity
  • Unlike the standard ambient dimension, our bounds do not necessarily grow with the number of parameters in the network, and they solely depend on the tail behavior of the training process, providing an explanation for the implicit regularization effect of heavy-tailed Stochastic gradient descent (SGD)
  • Our results suggest that the fractal structure and the fractal dimensions of deep learning models can be an accurate metric for the generalization error; in a broader context, we believe that our theory would be useful for practitioners using deep learning tools
Methods
  • The authors empirically study the generalization behavior of deep neural networks from the Hausdorff dimension perspective.
  • The authors use VGG networks [SZ15] as they perform well in practice, and their depth can be controlled directly.
  • The authors vary the number of layers from D = 4 to D = 19, resulting in the number of parameters d between 1.3M and 20M.
  • The authors provide full range of parameters and additional implementation details in the supplementary document.
  • The code can be found in https://github.com/umutsimsekli/ Hausdorff-Dimension-and-Generalization
Conclusion
  • The authors rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon.
  • The authors established this relationship through the Hausdorff dimension of the SDE approximating the algorithm, and proved a generalization error bound based on this notion of complexity.
  • The authors' work does not have a direct ethical or societal consequence due to its theoretical nature
Summary
  • Introduction:

    Many important tasks in deep learning can be represented by the following optimization problem, 1 min f (w) := n f (i)(w) , (1).
  • Given an initial point w0, the SGD algorithm is based on the following recursion, wk+1 = wk − η∇fk(wk) with ∇fk(w) := 1 B.
  • In contrast to convex optimization setting where the behavior of SGD is fairly well-understood, the generalization properties of SGD in non-convex deep learning problems is an active area of research [PBL19, AZL19, AZLL19].
  • There has been considerable progress around this topic, where several generalization bounds have been proven in
  • Objectives:

    The authors aim to take a first step in this direction and prove novel generalization bounds in the case where the trajectories of the optimization algorithm can be well-approximated by a Feller process [Sch16], which form a broad class of Markov processes that includes many important stochastic processes as a special case.
  • Methods:

    The authors empirically study the generalization behavior of deep neural networks from the Hausdorff dimension perspective.
  • The authors use VGG networks [SZ15] as they perform well in practice, and their depth can be controlled directly.
  • The authors vary the number of layers from D = 4 to D = 19, resulting in the number of parameters d between 1.3M and 20M.
  • The authors provide full range of parameters and additional implementation details in the supplementary document.
  • The code can be found in https://github.com/umutsimsekli/ Hausdorff-Dimension-and-Generalization
  • Conclusion:

    The authors rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon.
  • The authors established this relationship through the Hausdorff dimension of the SDE approximating the algorithm, and proved a generalization error bound based on this notion of complexity.
  • The authors' work does not have a direct ethical or societal consequence due to its theoretical nature
Funding
  • The contribution of U.S . to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project
Reference
  • [AAV18] Amir Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pages 7234–7243, 2018.
    Google ScholarLocate open access versionFindings
  • [AB99] Siu-Kui Au and James L Beck. A new adaptive importance sampling scheme for reliability calculations. Structural Safety, 21(2):135–158, 1999.
    Google ScholarLocate open access versionFindings
  • [AB09] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
    Google ScholarFindings
  • [Ass83] Patrick Assouad. Densité et dimension. In Annales de l’Institut Fourier, volume 33, pages 233–282, 1983.
    Google ScholarLocate open access versionFindings
  • [AZL19] Zeyuan Allen-Zhu and Yuanzhi Li. Can sgd learn recurrent neural networks with provable generalization? In Advances in Neural Information Processing Systems, pages 10331–10341, 2019.
    Google ScholarLocate open access versionFindings
  • [AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155–6166, 2019.
    Google ScholarLocate open access versionFindings
  • [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2(Mar), 2002.
    Google ScholarLocate open access versionFindings
  • [BG60] Robert M Blumenthal and Ronald K Getoor. Some theorems on stable processes. Transactions of the American Mathematical Society, 95(2):263–273, 1960.
    Google ScholarLocate open access versionFindings
  • [BJSG+18] Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, Gerard Ben Arous, Chiara Cammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep neural networks versus glassy systems. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 314–323, 10–15 Jul 2018.
    Google ScholarLocate open access versionFindings
  • [BP17] Christopher J Bishop and Yuval Peres. Fractals in probability and analysis. Cambridge University Press, 2017.
    Google ScholarFindings
  • [Bra83] Richard C Bradley. On the ψ-mixing condition for stationary random sequences. Transactions of the American Mathematical Society, 276(1):55–66, 1983.
    Google ScholarLocate open access versionFindings
  • [BSW13] B Böttcher, R Schilling, and J Wang. Lévy matters. iii. lévy-type processes: construction, approximation and sample path properties. Lecture Notes in Mathematics, 2099, 2013.
    Google ScholarLocate open access versionFindings
  • [Cou65] Philippe Courrege. Sur la forme intégro-différentielle des opérateurs de c∞ k dans c satisfaisant au principe du maximum. Séminaire Brelot-Choquet-Deny. Théorie du Potentiel, 10(1):1–38, 1965.
    Google ScholarLocate open access versionFindings
  • [CS18] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society, 79(3):651–676, 2017.
    Google ScholarLocate open access versionFindings
  • [DDB19] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. The Annals of Statistics (to appear), 2019.
    Google ScholarLocate open access versionFindings
  • [DR17] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
    Findings
  • [DSD19] Nadav Dym, Barak Sober, and Ingrid Daubechies. Expression of fractals through neural network functions. arXiv preprint:1905.11345, 2019.
    Findings
  • [EH20] Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness. arXiv preprint arXiv:2005.13097, 2020.
    Findings
  • [EMG90] GA Edgar, Topology Measure, and Fractal Geometry. Undergraduate texts in mathematics, 1990.
    Google ScholarFindings
  • [EMS18] Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization with discretized diffusions. In Advances in Neural Information Processing Systems, pages 9671–9680, 2018.
    Google ScholarLocate open access versionFindings
  • [Fal04] Kenneth Falconer. Fractal geometry: mathematical foundations and applications. John Wiley & Sons, 2004.
    Google ScholarFindings
  • [FFP20] Stefano Favaro, Sandra Fortini, and Stefano Peluchetti. Stable behaviour of infinitely wide deep neural networks. In AISTATS, 2020.
    Google ScholarLocate open access versionFindings
  • [GSZ20] Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. arXiv preprint arXiv:2006.04740, 2020.
    Findings
  • [HDS18] Qiao Huang, Jinqiao Duan, and Renming Song. Homogenization of stable-like feller processes. arXiv preprint:1812.11624, 2018.
    Findings
  • [Hen73] WJ Hendricks. A dimension theorem for sample functions of processes with stable components. The Annals of Probability, pages 849–853, 1973.
    Google ScholarLocate open access versionFindings
  • [HLLL17] W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint:1705.07562, 2017.
    Findings
  • [HM20] Liam Hodgkinson and Michael W Mahoney. Multiplicative noise and heavy tails in stochastic optimization. arXiv preprint arXiv:2006.06293, 2020.
    Findings
  • [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
    Google ScholarLocate open access versionFindings
  • [IP06] Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable lévy processes. Stochastic Processes and their Applications, 116(4):611–642, 2006.
    Google ScholarLocate open access versionFindings
  • [Jac02] Niels Jacob. Pseudo Differential Operators And Markov Processes: Volume II: Generators and Their Potential Theory. World Scientific, 2002.
    Google ScholarLocate open access versionFindings
  • [JKA+17] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. arXiv preprint:1711.04623, 2017.
    Findings
  • [KH09] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • [KL17] Ilja Kuzborskij and Christoph H Lampert. Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678, 2017.
    Findings
  • [Lév37] P. Lévy. Théorie de l’addition des variables aléatoires. Gauthiers-Villars, Paris, 1937.
    Google ScholarFindings
  • [LG19] Ronan Le Guével. The hausdorff dimension of the range of the lévy multistable processes. Journal of Theoretical Probability, 32(2):765–780, 2019.
    Google ScholarLocate open access versionFindings
  • [Lon17] Ben London. A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2931–2940, 2017.
    Google ScholarLocate open access versionFindings
  • [LY19] József Lorinczi and Xiaochuan Yang. Multifractal properties of sample paths of ground state-transformed jump processes. Chaos, Solitons & Fractals, 120:83–94, 2019.
    Google ScholarLocate open access versionFindings
  • [Mat99] Pertti Mattila. Geometry of sets and measures in Euclidean spaces: fractals and rectifiability. Cambridge university press, 1999.
    Google ScholarFindings
  • [MBM16] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. arXiv preprint arXiv:1607.06534, 2016.
    Findings
  • [MHB16] S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • [MM19] Charles H Martin and Michael W Mahoney. Traditional and heavy-tailed self regularization in neural network models. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • [MMO15] Mohammad Mohammadi, Adel Mohammadpour, and Hiroaki Ogata. On estimating the tail index and the spectral measure of multivariate α-stable distributions. Metrika, 78(5):549–561, 2015.
    Google ScholarLocate open access versionFindings
  • [MSS19] Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • [MWZZ17] Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.
    Findings
  • [MX05] Mark M Meerschaert and Yimin Xiao. Dimension results for sample paths of operator stable Lévy processes. Stochastic processes and their applications, 115(1):55–75, 2005.
    Google ScholarLocate open access versionFindings
  • [NBMS17] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [NHD+19] Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, pages 11015–11025, 2019.
    Google ScholarLocate open access versionFindings
  • [NS GR19] Thanh Huy Nguyen, Umut Simsekli, Mert Gürbüzbalaban, and Gaël Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • [NTS15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
    Google ScholarLocate open access versionFindings
  • [Pav07] Ilya Pavlyukevich. Cooling down lévy flights. Journal of Physics A: Mathematical and Theoretical, 40(41):12299, 2007.
    Google ScholarLocate open access versionFindings
  • [PBL19] Tomaso Poggio, Andrzej Banburski, and Qianli Liao. Theoretical issues in deep networks: Approximation, optimization and generalization. arXiv preprint arXiv:1908.09375, 2019.
    Findings
  • [PSGN19] Abhishek Panigrahi, Raghav Somani, Navin Goyal, and Praneeth Netrapalli. NonGaussianity of stochastic gradient noise. arXiv preprint:1910.09626, 2019.
    Findings
  • [RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint:1702.03849, 2017.
    Findings
  • [RZ19] Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. Transactions on Information Theory, 66(1):302–323, 2019.
    Google ScholarLocate open access versionFindings
  • [Sat99] Ken-iti Sato. Lévy processes and infinitely divisible distributions. Cambridge university press, 1999.
    Google ScholarFindings
  • [Sch98] René L Schilling. Feller processes generated by pseudo-differential operators: On the hausdorff dimension of their sample paths. Journal of Theoretical Probability, 11(2):303–330, 1998.
    Google ScholarLocate open access versionFindings
  • [Sch16] René L. Schilling. An introduction to lévy and feller processes. In D. Khoshnevisan and R. Schilling, editors, Lévy-type processes to parabolic SPDEs. Birkhäuser, Cham, 2016.
    Google ScholarLocate open access versionFindings
  • [S GN+19] Umut Simsekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018, 2019.
    Findings
  • [SHTY13] Mahito Sugiyama, Eiju Hirowatari, Hideki Tsuiki, and Akihiro Yamamoto. Learning figures with the hausdorff metric by fractals—towards computable binary classification. Machine learning, 90(1):91–126, 2013.
    Google ScholarLocate open access versionFindings
  • [SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
    Google ScholarFindings
  • [SSG19] Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • [ST94] G. Samorodnitsky and M. S. Taqqu. Stable non-Gaussian random processes: stochastic models with infinite variance, volume 1. CRC press, 1994.
    Google ScholarFindings
  • [SZ15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Yoshua Bengio and Yann LeCun, editors, ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [Xia03] Yimin Xiao. Random fractals and markov processes. Mathematics Preprint Archive, 2003(6):830–907, 2003.
    Google ScholarLocate open access versionFindings
  • [XR17] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [XZ20] Longjie Xie and Xicheng Zhang. Ergodicity of stochastic differential equations with jumps and singular coefficients. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 56, pages 175–229. Institut Henri Poincaré, 2020.
    Google ScholarLocate open access versionFindings
  • [Yan18] Xiaochuan Yang. Multifractality of jump diffusion processes. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 54, pages 2042–2074. Institut Henri Poincaré, 2018.
    Google ScholarLocate open access versionFindings
  • [ZKV+19] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why ADAM beats SGD for attention models. arXiv preprint:1912.03194, 2019.
    Findings
  • [ZLZ19] Yi Zhou, Yingbin Liang, and Huishuai Zhang. Understanding generalization error of sgd in nonconvex optimization. stat, 1050:7, 2019.
    Google ScholarLocate open access versionFindings
  • [ZWY+19] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In ICML, 2019.
    Google ScholarLocate open access versionFindings
Author
Umut Simsekli
Umut Simsekli
Ozan Sener
Ozan Sener
George Deligiannidis
George Deligiannidis
Murat Erdogdu
Murat Erdogdu
Your rating :
0

 

Tags
Comments
小科