Deep Neural Network Approximation Theory

Dmytro Perekrestenko
Dmytro Perekrestenko
Dennis Elbrächter
Dennis Elbrächter

arXiv: Learning, Volume abs/1901.02220, 2019.

Cited by: 28|Bibtex|Views56
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Deep neural networks provide optimal approximation of a very wide range of functions and function classes used in mathematical signal processing

Abstract:

Deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification, handwritten digit recognition, speech recognition, or game intelligence. This paper develops the fundamental limits of learning in deep neural networks by characterizing what is possible if no con...More

Code:

Data:

0
Introduction
  • Triggered by the availability of vast amounts of training data and drastic improvements in computing power, deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4].
  • There exist a constant C > 0 and a polynomial π such that for all D ∈ R+, f ∈ SD, and ε ∈ (0, 1/2), there is a network Ψf,ε ∈ NN∞,∞,1,1 satisfying L(Ψf,ε) ≤ C D (log(ε−1))2, W(Ψf,ε) ≤ 23, B(Ψf,ε) ≤ max{1/D, D }π(ε−1), and
Highlights
  • Triggered by the availability of vast amounts of training data and drastic improvements in computing power, deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4]
  • It is natural to ask how the complexity of a neural network approximating every function in C to within a prescribed accuracy depends on the complexity of C
  • The purpose of this paper is to provide a comprehensive, principled, and self-contained introduction to Kolmogorov rate-distortion optimal approximation through deep neural networks
  • Deep neural networks provide optimal approximation of a very wide range of functions and function classes used in mathematical signal processing
  • Most closely related to the framework we develop here is the recent paper by Shaham, Cloninger, and Coifman [45], which shows that for functions that are sparse in specific wavelet frames, the best M -weight approximation rate of three-layer neural networks is at least as high as the best M -term approximation rate in piecewise linear wavelet frames
  • IMPOSSIBILITY RESULTS FOR FINITE-DEPTH NETWORKS This section makes a formal case for deep networks by establishing that, for non-constant periodic functions, finite-width deep networks require asymptotically—in the function’s “highest frequency”—smaller connectivity than finite-depth wide networks
Results
  • Recall that x → |x| = ρ(x) + ρ(−x) can be implemented by a 2-layer network and consider the realization of x → g log(a)−log(2π) (Cax), a ∈ R+, as developed in the proof of Proposition III.1.
  • [48], [49] Let d ∈ N and Ω ⊂ Rd. The effective best M -term approximation rate of the function class C ⊂ L2(Ω) in the representation system D ⊂ L2(Ω) satisfies γ∗,eff(C, D) ≤ γ∗(C).
  • The authors start by noting that, thanks to the polynomial depth search constraint, the indices of the elements of D participating in the best M -term representation of f can be represented by a total of M log π(M ) = CM log(M ) bits for some constant C.
  • The two restrictions underlying the concept of effective best M -term approximation through representation systems, namely polynomial depth search and bounded coefficients, are addressed in the context of approximation through deep neural networks.
  • The authors can conclude that the tree-like-structure of neural networks automatically guarantees what the authors had to enforce through the polynomial depth search constraint in the case of best M -term approximation.
  • The second restriction made in the definition of effective best M -term approximation, namely bounded coefficients, will be replaced by a more generous growth condition on the network weights; the authors will allow the magnitude of the weights to grow polynomially in M .
  • A key ingredient of the proof of Theorem V.9 is the following result, which establishes a fundamental lower bound on the connectivity of networks with quantized weights achieving uniform error ε over a given function class C.
Conclusion
  • The result just established applies to networks that have each weight represented by a finite number of bits scaling according to (log(ε−1))q, for some q ∈ N, while guaranteeing that the underlying encoder-decoder pair achieves uniform error ε over C.
  • Proposition V.12 says that the connectivity growth rate of networks with quantized weights achieving uniform approximation error ε over a function class C must exceed O ε−1/γ∗(C) , ε → 0, but its proof, by virtue of constructing an encoder-decoder pair that achieves this growth rate provides an achievability result.
Summary
  • Triggered by the availability of vast amounts of training data and drastic improvements in computing power, deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4].
  • There exist a constant C > 0 and a polynomial π such that for all D ∈ R+, f ∈ SD, and ε ∈ (0, 1/2), there is a network Ψf,ε ∈ NN∞,∞,1,1 satisfying L(Ψf,ε) ≤ C D (log(ε−1))2, W(Ψf,ε) ≤ 23, B(Ψf,ε) ≤ max{1/D, D }π(ε−1), and
  • Recall that x → |x| = ρ(x) + ρ(−x) can be implemented by a 2-layer network and consider the realization of x → g log(a)−log(2π) (Cax), a ∈ R+, as developed in the proof of Proposition III.1.
  • [48], [49] Let d ∈ N and Ω ⊂ Rd. The effective best M -term approximation rate of the function class C ⊂ L2(Ω) in the representation system D ⊂ L2(Ω) satisfies γ∗,eff(C, D) ≤ γ∗(C).
  • The authors start by noting that, thanks to the polynomial depth search constraint, the indices of the elements of D participating in the best M -term representation of f can be represented by a total of M log π(M ) = CM log(M ) bits for some constant C.
  • The two restrictions underlying the concept of effective best M -term approximation through representation systems, namely polynomial depth search and bounded coefficients, are addressed in the context of approximation through deep neural networks.
  • The authors can conclude that the tree-like-structure of neural networks automatically guarantees what the authors had to enforce through the polynomial depth search constraint in the case of best M -term approximation.
  • The second restriction made in the definition of effective best M -term approximation, namely bounded coefficients, will be replaced by a more generous growth condition on the network weights; the authors will allow the magnitude of the weights to grow polynomially in M .
  • A key ingredient of the proof of Theorem V.9 is the following result, which establishes a fundamental lower bound on the connectivity of networks with quantized weights achieving uniform error ε over a given function class C.
  • The result just established applies to networks that have each weight represented by a finite number of bits scaling according to (log(ε−1))q, for some q ∈ N, while guaranteeing that the underlying encoder-decoder pair achieves uniform error ε over C.
  • Proposition V.12 says that the connectivity growth rate of networks with quantized weights achieving uniform approximation error ε over a function class C must exceed O ε−1/γ∗(C) , ε → 0, but its proof, by virtue of constructing an encoder-decoder pair that achieves this growth rate provides an achievability result.
Funding
  • Bolcskei were supported in part by a gift from Huawei’s Future Network Theory Lab
  • Elbrachter was supported by the Austrian Science Fund via the project P 30148
Reference
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
    Findings
  • Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Comparison of learning algorithms for handwritten digit recognition,” International Conference on Artificial Neural Networks, pp. 53–60, 1995.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
    Google ScholarLocate open access versionFindings
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [Online]. Available: http://www.nature.com/nature/journal/v529/n7587/abs/nature16961.html#supplementary-information
    Locate open access versionFindings
  • Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 201[Online]. Available: http://dx.doi.org/10.1038/nature14539
    Locate open access versionFindings
  • I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
    Findings
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986. [Online]. Available: http://dx.doi.org/10.1038/323533a0
    Locate open access versionFindings
  • M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
    Google ScholarFindings
  • G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 198[Online]. Available: http://dx.doi.org/10.1007/BF02551274
    Locate open access versionFindings
  • K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T
    Locate open access versionFindings
  • A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.
    Google ScholarLocate open access versionFindings
  • H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal on Mathematics of Data Science, 2019, to appear.
    Google ScholarLocate open access versionFindings
  • K. Grochenig and S. Samarah, “Nonlinear approximation with local Fourier bases,” Constructive Approximation, vol. 16, no. 3, pp. 317–331, Jul 2000.
    Google ScholarLocate open access versionFindings
  • K. Grochenig, Foundations of time-frequency analysis. Springer Science & Business Media, 2013.
    Google ScholarFindings
  • L. Demanet and L. Ying, “Wave atoms and sparsity of oscillatory patterns,” Appl. Comput. Harmon. Anal., vol. 23, no. 3, pp. 368–387, 2007.
    Google ScholarLocate open access versionFindings
  • C. Fefferman, “Reconstructing a neural net from its output,” Revista Matematica Iberoamericana, vol. 10, no. 3, pp. 507–555, 1994.
    Google ScholarLocate open access versionFindings
  • P. Petersen and F. Voigtlaender, “Optimal approximation of piecewise smooth functions using deep ReLU neural networks,” Neural Networks, vol. 108, pp. 296–330, Sep. 2018.
    Google ScholarLocate open access versionFindings
  • D. Yarotsky, “Error bounds for approximations with deep ReLU networks,” Neural Networks, vol. 94, pp. 103–114, 2017.
    Google ScholarLocate open access versionFindings
  • M. Telgarsky, “Representation benefits of deep feedforward networks,” arXiv:1509.08101, 2015.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • C. Schwab and J. Zech, “Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ,” Analysis and Applications (Singapore), 2018.
    Google ScholarLocate open access versionFindings
  • M. H. Stone, “The generalized Weierstrass approximation theorem,” Mathematics Magazine, vol. 21, pp. 167–184, 1948.
    Google ScholarLocate open access versionFindings
  • S. Liang and R. Srikant, “Why deep neural networks for function approximation?” International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • P. L. Bartlett, “The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network,” IEEE Transactions on Information Theory, vol. 44, no. 2, pp. 525–536, 1998.
    Google ScholarLocate open access versionFindings
  • A. R. Barron, “Approximation and estimation bounds for artificial neural networks,” Mach. Learn., vol. 14, no. 1, pp. 115–133, 1994. [Online]. Available: http://dx.doi.org/10.1007/BF00993164
    Locate open access versionFindings
  • C. K. Chui, X. Li, and H. N. Mhaskar, “Neural networks for localized approximation,” Math. Comp., vol. 63, no. 208, pp. 607–623, 1994. [Online]. Available: http://dx.doi.org/10.2307/2153285
    Locate open access versionFindings
  • R. DeVore, K. Oskolkov, and P. Petrushev, “Approximation by feed-forward neural networks,” Ann. Numer. Math., vol. 4, pp. 261–287, 1996.
    Google ScholarLocate open access versionFindings
  • E. J. Candes, “Ridgelets: Theory and Applications,” 1998, Ph.D. thesis, Stanford University.
    Google ScholarFindings
  • H. N. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Comput., vol. 8, no. 1, pp. 164–177, 1996.
    Google ScholarLocate open access versionFindings
  • H. Mhaskar and C. Micchelli, “Degree of approximation by neural and translation networks with a single hidden layer,” Adv. Appl. Math., vol. 16, no. 2, pp. 151–183, 1995.
    Google ScholarLocate open access versionFindings
  • K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
    Google ScholarLocate open access versionFindings
  • H. N. Mhaskar, “Approximation properties of a multilayered feedforward artificial neural network,” Advances in Computational Mathematics, vol. 1, no. 1, pp. 61–80, Feb 1993. [Online]. Available: https://doi.org/10.1007/BF02070821
    Locate open access versionFindings
  • K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183–192, 1989. [Online]. Available: //www.sciencedirect.com/science/article/pii/0893608089900038
    Locate open access versionFindings
  • T. Nguyen-Thien and T. Tran-Cong, “Approximation of functions and their derivatives: A neural network implementation with applications,” Appl. Math. Model., vol. 23, no. 9, pp. 687–704, 1999. [Online]. Available: //www.sciencedirect.com/science/article/pii/ S0307904X99000062
    Locate open access versionFindings
  • R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, 2016, pp. 907–940.
    Google ScholarLocate open access versionFindings
  • H. N. Mhaskar and T. Poggio, “Deep vs. shallow networks: An approximation theory perspective,” Analysis and Applications, vol. 14, no. 6, pp. 829–848, 2016. [Online]. Available: http://www.worldscientific.com/doi/abs/10.1142/S0219530516400042
    Locate open access versionFindings
  • N. Cohen, O. Sharir, and A. Shashua, “On the expressive power of deep learning: A tensor analysis,” in Proceedings of the 29th Conference on Learning Theory, vol. 49, 2016, pp. 698–728.
    Google ScholarLocate open access versionFindings
  • N. Cohen and A. Shashua, “Convolutional rectifier networks as generalized tensor decompositions,” in Proceedings of the 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 955–963.
    Google ScholarLocate open access versionFindings
  • P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger, “A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations,” arXiv e-prints, p. arXiv:1809.02362, Sep. 2018.
    Findings
  • J. Berner, P. Grohs, and A. Jentzen, “Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations,” arXiv e-prints, p. arXiv:1809.03062, Sep. 2018.
    Findings
  • C. Beck, S. Becker, P. Grohs, N. Jaafari, and A. Jentzen, “Solving stochastic differential equations and Kolmogorov equations by means of deep learning,” arXiv e-prints, p. arXiv:1806.00421, Jun. 2018.
    Findings
  • D. Elbrachter, P. Grohs, A. Jentzen, and C. Schwab, “DNN expression rate analysis of high-dimensional PDEs: Application to option pricing,” arXiv preprint arXiv:1809.07669, 2018.
    Findings
  • S. Ellacott, “Aspects of the numerical analysis of neural networks,” Acta Numer., vol. 3, pp. 145–202, 1994.
    Google ScholarLocate open access versionFindings
  • A. Pinkus, “Approximation theory of the MLP model in neural networks,” Acta Numer., vol. 8, pp. 143–195, 1999.
    Google ScholarLocate open access versionFindings
  • U. Shaham, A. Cloninger, and R. R. Coifman, “Provable approximation properties for deep neural networks,” Appl. Comput. Harmon. Anal., vol. 44, no. 3, pp. 537–557, May 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1509.html#ShahamCC15
    Locate open access versionFindings
  • R. A. DeVore and G. G. Lorentz, Constructive Approximation. Springer, 1993.
    Google ScholarFindings
  • R. A. DeVore, “Nonlinear approximation,” Acta Numerica, vol. 7, pp. 51–150, 1998.
    Google ScholarLocate open access versionFindings
  • D. L. Donoho, “Unconditional bases are optimal bases for data compression and for statistical estimation,” Appl. Comput. Harmon. Anal., vol. 1, no. 1, pp. 100 – 115, 1993. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1063520383710080
    Locate open access versionFindings
  • P. Grohs, “Optimally sparse data representations,” in Harmonic and Applied Analysis. Springer, 2015, pp. 199–248.
    Google ScholarFindings
  • E. Ott, Chaos in Dynamical Systems. Cambridge Univ. Press, 2002.
    Google ScholarFindings
  • A. Cohen, W. Dahmen, I. Daubechies, and R. A. DeVore, “Tree approximation and optimal encoding,” Appl. Comput. Harmon. Anal., vol. 11, no. 2, pp. 192–226, 2001.
    Google ScholarLocate open access versionFindings
  • D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constr. Approx., vol. 17, no. 3, pp. 353–382, 2001. [Online]. Available: http://dx.doi.org/10.1007/s003650010032
    Locate open access versionFindings
  • P. Grohs, S. Keiper, G. Kutyniok, and M. Schafer, “Cartoon approximation with α-curvelets,” J. Fourier Anal. Appl., vol. 22, no. 6, pp. 1235–1293, 2016. [Online]. Available: http://dx.doi.org/10.1007/s00041-015-9446-6
    Locate open access versionFindings
  • P. Grohs, S. Keiper, G. Kutyniok, and M. Schafer, “α-molecules,” Appl. Comput. Harmon. Anal., vol. 41, no. 1, pp. 297–336, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.acha.2015.10.009
    Locate open access versionFindings
  • I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
    Google ScholarLocate open access versionFindings
  • E. J. Candes and D. L. Donoho, “New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities,” Comm. Pure Appl. Math., vol. 57, pp. 219–266, 2002.
    Google ScholarLocate open access versionFindings
  • K. Guo, G. Kutyniok, and D. Labate, “Sparse multidimensional representations using anisotropic dilation and shear operators,” in Wavelets and Splines (Athens, GA, 2005). Nashboro Press, Nashville, TN, 2006, pp. 189–201.
    Google ScholarFindings
  • P. Grohs and G. Kutyniok, “Parabolic molecules,” Found. Comput. Math., vol. 14, pp. 299–337, 2014.
    Google ScholarLocate open access versionFindings
  • I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
    Google ScholarLocate open access versionFindings
  • M. Unser, “Ten good reasons for using spline wavelets,” Wavelet Applications in Signal and Image Processing V, vol. 3169, pp. 422–431, 1997.
    Google ScholarLocate open access versionFindings
  • C. K. Chui and J.-Z. Wang, “On compactly supported spline wavelets and a duality principle,” Transactions of the American Mathematical Society, 1992.
    Google ScholarLocate open access versionFindings
  • G. B. Folland, Harmonic Analysis in Phase Space. (AM-122). Princeton University Press, 1989. [Online]. Available: http://www.jstor.org/stable/j.ctt1b9rzs2
    Findings
  • C. L. Fefferman, “The uncertainty principle,” Bull. Amer. Math. Soc. (N.S.), vol. 9, no. 2, pp. 129–206, 1983. [Online]. Available: https://doi.org/10.1090/S0273-0979-1983-15154-6
    Locate open access versionFindings
  • H. Feichtinger, “On a new Segal algebra,” vol. 92, pp. 269–289, 01 1981.
    Google ScholarFindings
  • G. B. Folland, Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, 2013.
    Google ScholarFindings
  • A. Zygmund, Trigonometric series. Cambridge University Press, 2002.
    Google ScholarFindings
  • C. Frenzen, T. Sasao, and J. T. Butler, “On the number of segments needed in a piecewise linear approximation,” Journal of Computational and Applied Mathematics, vol. 234, no. 2, pp. 437 – 446, 2010.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments