# Deep Neural Network Approximation Theory

arXiv: Learning, Volume abs/1901.02220, 2019.

EI

Weibo:

Abstract:

Deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification, handwritten digit recognition, speech recognition, or game intelligence. This paper develops the fundamental limits of learning in deep neural networks by characterizing what is possible if no con...More

Code:

Data:

Introduction

- Triggered by the availability of vast amounts of training data and drastic improvements in computing power, deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4].
- There exist a constant C > 0 and a polynomial π such that for all D ∈ R+, f ∈ SD, and ε ∈ (0, 1/2), there is a network Ψf,ε ∈ NN∞,∞,1,1 satisfying L(Ψf,ε) ≤ C D (log(ε−1))2, W(Ψf,ε) ≤ 23, B(Ψf,ε) ≤ max{1/D, D }π(ε−1), and

Highlights

- Triggered by the availability of vast amounts of training data and drastic improvements in computing power, deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4]
- It is natural to ask how the complexity of a neural network approximating every function in C to within a prescribed accuracy depends on the complexity of C
- The purpose of this paper is to provide a comprehensive, principled, and self-contained introduction to Kolmogorov rate-distortion optimal approximation through deep neural networks
- Deep neural networks provide optimal approximation of a very wide range of functions and function classes used in mathematical signal processing
- Most closely related to the framework we develop here is the recent paper by Shaham, Cloninger, and Coifman [45], which shows that for functions that are sparse in specific wavelet frames, the best M -weight approximation rate of three-layer neural networks is at least as high as the best M -term approximation rate in piecewise linear wavelet frames
- IMPOSSIBILITY RESULTS FOR FINITE-DEPTH NETWORKS This section makes a formal case for deep networks by establishing that, for non-constant periodic functions, finite-width deep networks require asymptotically—in the function’s “highest frequency”—smaller connectivity than finite-depth wide networks

Results

- Recall that x → |x| = ρ(x) + ρ(−x) can be implemented by a 2-layer network and consider the realization of x → g log(a)−log(2π) (Cax), a ∈ R+, as developed in the proof of Proposition III.1.
- [48], [49] Let d ∈ N and Ω ⊂ Rd. The effective best M -term approximation rate of the function class C ⊂ L2(Ω) in the representation system D ⊂ L2(Ω) satisfies γ∗,eff(C, D) ≤ γ∗(C).
- The authors start by noting that, thanks to the polynomial depth search constraint, the indices of the elements of D participating in the best M -term representation of f can be represented by a total of M log π(M ) = CM log(M ) bits for some constant C.
- The two restrictions underlying the concept of effective best M -term approximation through representation systems, namely polynomial depth search and bounded coefficients, are addressed in the context of approximation through deep neural networks.
- The authors can conclude that the tree-like-structure of neural networks automatically guarantees what the authors had to enforce through the polynomial depth search constraint in the case of best M -term approximation.
- The second restriction made in the definition of effective best M -term approximation, namely bounded coefficients, will be replaced by a more generous growth condition on the network weights; the authors will allow the magnitude of the weights to grow polynomially in M .
- A key ingredient of the proof of Theorem V.9 is the following result, which establishes a fundamental lower bound on the connectivity of networks with quantized weights achieving uniform error ε over a given function class C.

Conclusion

- The result just established applies to networks that have each weight represented by a finite number of bits scaling according to (log(ε−1))q, for some q ∈ N, while guaranteeing that the underlying encoder-decoder pair achieves uniform error ε over C.
- Proposition V.12 says that the connectivity growth rate of networks with quantized weights achieving uniform approximation error ε over a function class C must exceed O ε−1/γ∗(C) , ε → 0, but its proof, by virtue of constructing an encoder-decoder pair that achieves this growth rate provides an achievability result.

Summary

- Triggered by the availability of vast amounts of training data and drastic improvements in computing power, deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4].
- There exist a constant C > 0 and a polynomial π such that for all D ∈ R+, f ∈ SD, and ε ∈ (0, 1/2), there is a network Ψf,ε ∈ NN∞,∞,1,1 satisfying L(Ψf,ε) ≤ C D (log(ε−1))2, W(Ψf,ε) ≤ 23, B(Ψf,ε) ≤ max{1/D, D }π(ε−1), and
- Recall that x → |x| = ρ(x) + ρ(−x) can be implemented by a 2-layer network and consider the realization of x → g log(a)−log(2π) (Cax), a ∈ R+, as developed in the proof of Proposition III.1.
- [48], [49] Let d ∈ N and Ω ⊂ Rd. The effective best M -term approximation rate of the function class C ⊂ L2(Ω) in the representation system D ⊂ L2(Ω) satisfies γ∗,eff(C, D) ≤ γ∗(C).
- The authors start by noting that, thanks to the polynomial depth search constraint, the indices of the elements of D participating in the best M -term representation of f can be represented by a total of M log π(M ) = CM log(M ) bits for some constant C.
- The two restrictions underlying the concept of effective best M -term approximation through representation systems, namely polynomial depth search and bounded coefficients, are addressed in the context of approximation through deep neural networks.
- The authors can conclude that the tree-like-structure of neural networks automatically guarantees what the authors had to enforce through the polynomial depth search constraint in the case of best M -term approximation.
- The second restriction made in the definition of effective best M -term approximation, namely bounded coefficients, will be replaced by a more generous growth condition on the network weights; the authors will allow the magnitude of the weights to grow polynomially in M .
- A key ingredient of the proof of Theorem V.9 is the following result, which establishes a fundamental lower bound on the connectivity of networks with quantized weights achieving uniform error ε over a given function class C.
- The result just established applies to networks that have each weight represented by a finite number of bits scaling according to (log(ε−1))q, for some q ∈ N, while guaranteeing that the underlying encoder-decoder pair achieves uniform error ε over C.
- Proposition V.12 says that the connectivity growth rate of networks with quantized weights achieving uniform approximation error ε over a function class C must exceed O ε−1/γ∗(C) , ε → 0, but its proof, by virtue of constructing an encoder-decoder pair that achieves this growth rate provides an achievability result.

Funding

- Bolcskei were supported in part by a gift from Huawei’s Future Network Theory Lab
- Elbrachter was supported by the Austrian Science Fund via the project P 30148

Reference

- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
- Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Comparison of learning algorithms for handwritten digit recognition,” International Conference on Artificial Neural Networks, pp. 53–60, 1995.
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [Online]. Available: http://www.nature.com/nature/journal/v529/n7587/abs/nature16961.html#supplementary-information
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 201[Online]. Available: http://dx.doi.org/10.1038/nature14539
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986. [Online]. Available: http://dx.doi.org/10.1038/323533a0
- M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
- G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 198[Online]. Available: http://dx.doi.org/10.1007/BF02551274
- K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T
- A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.
- H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal on Mathematics of Data Science, 2019, to appear.
- K. Grochenig and S. Samarah, “Nonlinear approximation with local Fourier bases,” Constructive Approximation, vol. 16, no. 3, pp. 317–331, Jul 2000.
- K. Grochenig, Foundations of time-frequency analysis. Springer Science & Business Media, 2013.
- L. Demanet and L. Ying, “Wave atoms and sparsity of oscillatory patterns,” Appl. Comput. Harmon. Anal., vol. 23, no. 3, pp. 368–387, 2007.
- C. Fefferman, “Reconstructing a neural net from its output,” Revista Matematica Iberoamericana, vol. 10, no. 3, pp. 507–555, 1994.
- P. Petersen and F. Voigtlaender, “Optimal approximation of piecewise smooth functions using deep ReLU neural networks,” Neural Networks, vol. 108, pp. 296–330, Sep. 2018.
- D. Yarotsky, “Error bounds for approximations with deep ReLU networks,” Neural Networks, vol. 94, pp. 103–114, 2017.
- M. Telgarsky, “Representation benefits of deep feedforward networks,” arXiv:1509.08101, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- C. Schwab and J. Zech, “Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ,” Analysis and Applications (Singapore), 2018.
- M. H. Stone, “The generalized Weierstrass approximation theorem,” Mathematics Magazine, vol. 21, pp. 167–184, 1948.
- S. Liang and R. Srikant, “Why deep neural networks for function approximation?” International Conference on Learning Representations, 2017.
- P. L. Bartlett, “The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network,” IEEE Transactions on Information Theory, vol. 44, no. 2, pp. 525–536, 1998.
- A. R. Barron, “Approximation and estimation bounds for artificial neural networks,” Mach. Learn., vol. 14, no. 1, pp. 115–133, 1994. [Online]. Available: http://dx.doi.org/10.1007/BF00993164
- C. K. Chui, X. Li, and H. N. Mhaskar, “Neural networks for localized approximation,” Math. Comp., vol. 63, no. 208, pp. 607–623, 1994. [Online]. Available: http://dx.doi.org/10.2307/2153285
- R. DeVore, K. Oskolkov, and P. Petrushev, “Approximation by feed-forward neural networks,” Ann. Numer. Math., vol. 4, pp. 261–287, 1996.
- E. J. Candes, “Ridgelets: Theory and Applications,” 1998, Ph.D. thesis, Stanford University.
- H. N. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Comput., vol. 8, no. 1, pp. 164–177, 1996.
- H. Mhaskar and C. Micchelli, “Degree of approximation by neural and translation networks with a single hidden layer,” Adv. Appl. Math., vol. 16, no. 2, pp. 151–183, 1995.
- K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
- H. N. Mhaskar, “Approximation properties of a multilayered feedforward artificial neural network,” Advances in Computational Mathematics, vol. 1, no. 1, pp. 61–80, Feb 1993. [Online]. Available: https://doi.org/10.1007/BF02070821
- K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183–192, 1989. [Online]. Available: //www.sciencedirect.com/science/article/pii/0893608089900038
- T. Nguyen-Thien and T. Tran-Cong, “Approximation of functions and their derivatives: A neural network implementation with applications,” Appl. Math. Model., vol. 23, no. 9, pp. 687–704, 1999. [Online]. Available: //www.sciencedirect.com/science/article/pii/ S0307904X99000062
- R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, 2016, pp. 907–940.
- H. N. Mhaskar and T. Poggio, “Deep vs. shallow networks: An approximation theory perspective,” Analysis and Applications, vol. 14, no. 6, pp. 829–848, 2016. [Online]. Available: http://www.worldscientific.com/doi/abs/10.1142/S0219530516400042
- N. Cohen, O. Sharir, and A. Shashua, “On the expressive power of deep learning: A tensor analysis,” in Proceedings of the 29th Conference on Learning Theory, vol. 49, 2016, pp. 698–728.
- N. Cohen and A. Shashua, “Convolutional rectifier networks as generalized tensor decompositions,” in Proceedings of the 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 955–963.
- P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger, “A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations,” arXiv e-prints, p. arXiv:1809.02362, Sep. 2018.
- J. Berner, P. Grohs, and A. Jentzen, “Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations,” arXiv e-prints, p. arXiv:1809.03062, Sep. 2018.
- C. Beck, S. Becker, P. Grohs, N. Jaafari, and A. Jentzen, “Solving stochastic differential equations and Kolmogorov equations by means of deep learning,” arXiv e-prints, p. arXiv:1806.00421, Jun. 2018.
- D. Elbrachter, P. Grohs, A. Jentzen, and C. Schwab, “DNN expression rate analysis of high-dimensional PDEs: Application to option pricing,” arXiv preprint arXiv:1809.07669, 2018.
- S. Ellacott, “Aspects of the numerical analysis of neural networks,” Acta Numer., vol. 3, pp. 145–202, 1994.
- A. Pinkus, “Approximation theory of the MLP model in neural networks,” Acta Numer., vol. 8, pp. 143–195, 1999.
- U. Shaham, A. Cloninger, and R. R. Coifman, “Provable approximation properties for deep neural networks,” Appl. Comput. Harmon. Anal., vol. 44, no. 3, pp. 537–557, May 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1509.html#ShahamCC15
- R. A. DeVore and G. G. Lorentz, Constructive Approximation. Springer, 1993.
- R. A. DeVore, “Nonlinear approximation,” Acta Numerica, vol. 7, pp. 51–150, 1998.
- D. L. Donoho, “Unconditional bases are optimal bases for data compression and for statistical estimation,” Appl. Comput. Harmon. Anal., vol. 1, no. 1, pp. 100 – 115, 1993. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1063520383710080
- P. Grohs, “Optimally sparse data representations,” in Harmonic and Applied Analysis. Springer, 2015, pp. 199–248.
- E. Ott, Chaos in Dynamical Systems. Cambridge Univ. Press, 2002.
- A. Cohen, W. Dahmen, I. Daubechies, and R. A. DeVore, “Tree approximation and optimal encoding,” Appl. Comput. Harmon. Anal., vol. 11, no. 2, pp. 192–226, 2001.
- D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constr. Approx., vol. 17, no. 3, pp. 353–382, 2001. [Online]. Available: http://dx.doi.org/10.1007/s003650010032
- P. Grohs, S. Keiper, G. Kutyniok, and M. Schafer, “Cartoon approximation with α-curvelets,” J. Fourier Anal. Appl., vol. 22, no. 6, pp. 1235–1293, 2016. [Online]. Available: http://dx.doi.org/10.1007/s00041-015-9446-6
- P. Grohs, S. Keiper, G. Kutyniok, and M. Schafer, “α-molecules,” Appl. Comput. Harmon. Anal., vol. 41, no. 1, pp. 297–336, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.acha.2015.10.009
- I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
- E. J. Candes and D. L. Donoho, “New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities,” Comm. Pure Appl. Math., vol. 57, pp. 219–266, 2002.
- K. Guo, G. Kutyniok, and D. Labate, “Sparse multidimensional representations using anisotropic dilation and shear operators,” in Wavelets and Splines (Athens, GA, 2005). Nashboro Press, Nashville, TN, 2006, pp. 189–201.
- P. Grohs and G. Kutyniok, “Parabolic molecules,” Found. Comput. Math., vol. 14, pp. 299–337, 2014.
- I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
- M. Unser, “Ten good reasons for using spline wavelets,” Wavelet Applications in Signal and Image Processing V, vol. 3169, pp. 422–431, 1997.
- C. K. Chui and J.-Z. Wang, “On compactly supported spline wavelets and a duality principle,” Transactions of the American Mathematical Society, 1992.
- G. B. Folland, Harmonic Analysis in Phase Space. (AM-122). Princeton University Press, 1989. [Online]. Available: http://www.jstor.org/stable/j.ctt1b9rzs2
- C. L. Fefferman, “The uncertainty principle,” Bull. Amer. Math. Soc. (N.S.), vol. 9, no. 2, pp. 129–206, 1983. [Online]. Available: https://doi.org/10.1090/S0273-0979-1983-15154-6
- H. Feichtinger, “On a new Segal algebra,” vol. 92, pp. 269–289, 01 1981.
- G. B. Folland, Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, 2013.
- A. Zygmund, Trigonometric series. Cambridge University Press, 2002.
- C. Frenzen, T. Sasao, and J. T. Butler, “On the number of segments needed in a piecewise linear approximation,” Journal of Computational and Applied Mathematics, vol. 234, no. 2, pp. 437 – 446, 2010.

Full Text

Tags

Comments