Measuring Model Complexity of Neural Networks with Curve Activation Functions

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1521-1531, 2020.

Cited by: 0|Bibtex|Views34|DOI:https://doi.org/10.1145/3394486.3403203
EI
Other Links: arxiv.org|dl.acm.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
After providing an upper bound to the number of linear regions formed by linear approximation neural network, we define the complexity measure facilitated by the upper bound

Abstract:

It is fundamental to measure model complexity of deep neural networks. A good model complexity measure can help to tackle many challenging problems, such as overfitting detection, model selection, and performance improvement. The existing literature on model complexity mainly focuses on neural networks with piecewise linear activation fun...More

Code:

Data:

0
Introduction
  • Deep neural networks have gained great popularity in tackling various real-world applications, such as machine translation [35], speech recognition [5] and computer vision [13].
  • The influences of model structure on complexity have been investigated, including layer width, network depth, and layer type.
  • With the exploration of deep network structures, some recent studies pay attention to the effectiveness of deep architectures in increasing model complexity, known as depth efficiency [2, 6, 11, 25].
  • The bounds of model complexity of some specific model structures are proposed, from sumproduct networks [8] to piecewise linear neural networks [27, 31]
Highlights
  • Deep neural networks have gained great popularity in tackling various real-world applications, such as machine translation [35], speech recognition [5] and computer vision [13]
  • The recent progress in model complexity measure directly facilitates the advances of many directions of deep neural networks, such as model architecture design, model selection, performance improvement [17], and overfitting detection [16]
  • We develop a complexity measure for deep fullyconnected neural networks with curve activation functions
  • We provide an upper bound on the number of linear regions formed by linear approximation neural network, and define the complexity measure using the upper bound
  • We propose an upper bound to the number of linear regions, propose the model complexity measure based on the upper bound
  • After providing an upper bound to the number of linear regions formed by linear approximation neural network, we define the complexity measure facilitated by the upper bound
Conclusion
  • The authors develope a complexity measure for deep neural networks with curve activation functions.
  • After providing an upper bound to the number of linear regions formed by LANNs, the authors define the complexity measure facilitated by the upper bound.
  • In the view of the complexity measure, further analysis revealed that L1, L2 regularizations suppress the increase of model complexity.
  • Based on this discovery, the authors proposed two approaches to prevent overfitting through directly constraining model complexity: neuron pruning and customized L1 regularization.
  • There are several future directions, including generalizing the usage of the proposed linear approximation neural network to other network architectures (i.e. CNN, RNN)
Summary
  • Introduction:

    Deep neural networks have gained great popularity in tackling various real-world applications, such as machine translation [35], speech recognition [5] and computer vision [13].
  • The influences of model structure on complexity have been investigated, including layer width, network depth, and layer type.
  • With the exploration of deep network structures, some recent studies pay attention to the effectiveness of deep architectures in increasing model complexity, known as depth efficiency [2, 6, 11, 25].
  • The bounds of model complexity of some specific model structures are proposed, from sumproduct networks [8] to piecewise linear neural networks [27, 31]
  • Conclusion:

    The authors develope a complexity measure for deep neural networks with curve activation functions.
  • After providing an upper bound to the number of linear regions formed by LANNs, the authors define the complexity measure facilitated by the upper bound.
  • In the view of the complexity measure, further analysis revealed that L1, L2 regularizations suppress the increase of model complexity.
  • Based on this discovery, the authors proposed two approaches to prevent overfitting through directly constraining model complexity: neuron pruning and customized L1 regularization.
  • There are several future directions, including generalizing the usage of the proposed linear approximation neural network to other network architectures (i.e. CNN, RNN)
Tables
  • Table1: Model structure of DNNs in our experiments
  • Table2: Complexity measure and number of linear regions of MOON
  • Table3: Compare approximation error on training dataset and test dataset
Download tables as Excel
Related work
  • The studies of model complexity dates back to several decades. In this section, we review related works of model complexity of neural networks from two aspects: model structures and parameters.

    2.1 Model Structures

    Model structures may have strong influence on model complexity, such as width, layer depth, and layer type.

    The power of layer width of shallow neural networks is investigated [1, 7, 19, 26] decades ago. Hornik et al [19] propose the universal approximation theorem, which states that a single layer feedforward network with a finite number of neurons can approximate any continuous function under some mild assumptions. Some later studies [1, 7, 26] further strengthen this theorem. However, although with the universal approximation theorem, the layer width can be exponentially large. Lu et al [25] extend the universal approximation theorem to deep networks with bounded layer width.
Funding
  • Xia Hu and Jian Pei’s research is supported in part by the NSERC Discovery Grant program
Reference
  • Andrew R Barron. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39, 3 (1993), 930–945.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio and Olivier Delalleau. 2011. On the expressive power of deep architectures. In International Conference on Algorithmic Learning Theory. Springer, 18–36.
    Google ScholarLocate open access versionFindings
  • Monica Bianchini and Franco Scarselli. 2014. On the complexity of shallow and deep neural network classifiers.. In ESANN.
    Google ScholarFindings
  • Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
    Google ScholarFindings
  • Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, et al. 2018. State-of-the-art speech recognition with sequence-to-sequence models. In 2018
    Google ScholarLocate open access versionFindings
  • Nadav Cohen, Or Sharir, and Amnon Shashua. 201On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory. 698–728.
    Google ScholarLocate open access versionFindings
  • George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314.
    Google ScholarFindings
  • Olivier Delalleau and Yoshua Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in NIPS. 666–674.
    Google ScholarLocate open access versionFindings
  • Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning for event-driven stock prediction. In Proceeding of the 24th IJCAI.
    Google ScholarLocate open access versionFindings
  • Simon S Du and Jason D Lee. 2018. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206 (2018).
    Findings
  • Ronen Eldan and Ohad Shamir. 2016. The power of depth for feedforward neural networks. In Conference on learning theory. 907–940.
    Google ScholarLocate open access versionFindings
  • Brendan J Frey and Geoffrey E Hinton. 1999. Variational learning in nonlinear Gaussian belief networks. Neural Computation 11, 1 (1999), 193–213.
    Google ScholarLocate open access versionFindings
  • Hongyang Gao and Shuiwang Ji. 2019. Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD. 741–749.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th AISTATS. 249–256.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
    Google ScholarFindings
  • Douglas M Hawkins. 2004. The problem of overfitting. Journal of chemical information and computer sciences 44, 1 (2004), 1–12.
    Google ScholarLocate open access versionFindings
  • Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. 2018. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266 (2018).
    Findings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. stat 1050 (2015), 9.
    Google ScholarFindings
  • Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
    Findings
  • Barry L Kalman and Stan C Kwasny. 1992. Why tanh: choosing a sigmoidal function. In [Proceedings 1992] IJCNN, Vol. 4. IEEE, 578–581.
    Google ScholarLocate open access versionFindings
  • Joe Kilian and Hava T Siegelmann. 1993. On the power of sigmoid neural networks. In Proceedings of the 6th annual conference on Computational learning theory. 137–143.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009).
    Google ScholarFindings
  • Yann LeCun, Léon Bottou, Yoshua Bengio, et al. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
    Google ScholarLocate open access versionFindings
  • Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. 2017. The expressive power of neural networks: A view from the width. In Advances in NIPS. 6231–6239.
    Google ScholarLocate open access versionFindings
  • Wolfgang Maass, Georg Schnitger, and Eduardo D Sontag. 1994. A comparison of the computational power of sigmoid and Boolean threshold circuits. In Theoretical Advances in Neural Computation and Learning. Springer, 127–151.
    Google ScholarLocate open access versionFindings
  • Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in NIPS. 2924–2932.
    Google ScholarLocate open access versionFindings
  • Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th ICML. 807–814.
    Google ScholarLocate open access versionFindings
  • Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. Sensitivity and Generalization in Neural Networks: an Empirical Study. In ICLR.
    Google ScholarLocate open access versionFindings
  • Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall.
    Google ScholarFindings
  • 2018. Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378 (2018).
    Findings
  • [31] Razvan Pascanu, Guido Montufar, and Yoshua Bengio. 2013. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098 (2013).
    Findings
  • [32] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. 2016. Exponential expressivity in deep neural networks through transient chaos. In Advances in NIPS. 3360–3368.
    Google ScholarLocate open access versionFindings
  • [33] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. 2017. On the expressive power of deep neural networks. In Proceedings of the 34th ICML-Volume 70. JMLR, 2847–2854.
    Google ScholarLocate open access versionFindings
  • [34] Bernard W Silverman. 2018. Density estimation for statistics and data analysis. Routledge.
    Google ScholarFindings
  • [35] Wei-Hung Weng, Yu-An Chung, and Peter Szolovits. 2019. Unsupervised Clinical Language Translation. Proceedings of the 25th ACM SIGKDD (2019).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments