# Measuring Model Complexity of Neural Networks with Curve Activation Functions

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1521-1531, 2020.

EI

Weibo:

Abstract:

It is fundamental to measure model complexity of deep neural networks. A good model complexity measure can help to tackle many challenging problems, such as overfitting detection, model selection, and performance improvement. The existing literature on model complexity mainly focuses on neural networks with piecewise linear activation fun...More

Code:

Data:

Introduction

- Deep neural networks have gained great popularity in tackling various real-world applications, such as machine translation [35], speech recognition [5] and computer vision [13].
- The influences of model structure on complexity have been investigated, including layer width, network depth, and layer type.
- With the exploration of deep network structures, some recent studies pay attention to the effectiveness of deep architectures in increasing model complexity, known as depth efficiency [2, 6, 11, 25].
- The bounds of model complexity of some specific model structures are proposed, from sumproduct networks [8] to piecewise linear neural networks [27, 31]

Highlights

- Deep neural networks have gained great popularity in tackling various real-world applications, such as machine translation [35], speech recognition [5] and computer vision [13]
- The recent progress in model complexity measure directly facilitates the advances of many directions of deep neural networks, such as model architecture design, model selection, performance improvement [17], and overfitting detection [16]
- We develop a complexity measure for deep fullyconnected neural networks with curve activation functions
- We provide an upper bound on the number of linear regions formed by linear approximation neural network, and define the complexity measure using the upper bound
- We propose an upper bound to the number of linear regions, propose the model complexity measure based on the upper bound
- After providing an upper bound to the number of linear regions formed by linear approximation neural network, we define the complexity measure facilitated by the upper bound

Conclusion

- The authors develope a complexity measure for deep neural networks with curve activation functions.
- After providing an upper bound to the number of linear regions formed by LANNs, the authors define the complexity measure facilitated by the upper bound.
- In the view of the complexity measure, further analysis revealed that L1, L2 regularizations suppress the increase of model complexity.
- Based on this discovery, the authors proposed two approaches to prevent overfitting through directly constraining model complexity: neuron pruning and customized L1 regularization.
- There are several future directions, including generalizing the usage of the proposed linear approximation neural network to other network architectures (i.e. CNN, RNN)

Summary

## Introduction:

Deep neural networks have gained great popularity in tackling various real-world applications, such as machine translation [35], speech recognition [5] and computer vision [13].- The influences of model structure on complexity have been investigated, including layer width, network depth, and layer type.
- With the exploration of deep network structures, some recent studies pay attention to the effectiveness of deep architectures in increasing model complexity, known as depth efficiency [2, 6, 11, 25].
- The bounds of model complexity of some specific model structures are proposed, from sumproduct networks [8] to piecewise linear neural networks [27, 31]
## Conclusion:

The authors develope a complexity measure for deep neural networks with curve activation functions.- After providing an upper bound to the number of linear regions formed by LANNs, the authors define the complexity measure facilitated by the upper bound.
- In the view of the complexity measure, further analysis revealed that L1, L2 regularizations suppress the increase of model complexity.
- Based on this discovery, the authors proposed two approaches to prevent overfitting through directly constraining model complexity: neuron pruning and customized L1 regularization.
- There are several future directions, including generalizing the usage of the proposed linear approximation neural network to other network architectures (i.e. CNN, RNN)

- Table1: Model structure of DNNs in our experiments
- Table2: Complexity measure and number of linear regions of MOON
- Table3: Compare approximation error on training dataset and test dataset

Related work

- The studies of model complexity dates back to several decades. In this section, we review related works of model complexity of neural networks from two aspects: model structures and parameters.

2.1 Model Structures

Model structures may have strong influence on model complexity, such as width, layer depth, and layer type.

The power of layer width of shallow neural networks is investigated [1, 7, 19, 26] decades ago. Hornik et al [19] propose the universal approximation theorem, which states that a single layer feedforward network with a finite number of neurons can approximate any continuous function under some mild assumptions. Some later studies [1, 7, 26] further strengthen this theorem. However, although with the universal approximation theorem, the layer width can be exponentially large. Lu et al [25] extend the universal approximation theorem to deep networks with bounded layer width.

Funding

- Xia Hu and Jian Pei’s research is supported in part by the NSERC Discovery Grant program

Reference

- Andrew R Barron. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39, 3 (1993), 930–945.
- Yoshua Bengio and Olivier Delalleau. 2011. On the expressive power of deep architectures. In International Conference on Algorithmic Learning Theory. Springer, 18–36.
- Monica Bianchini and Franco Scarselli. 2014. On the complexity of shallow and deep neural network classifiers.. In ESANN.
- Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
- Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, et al. 2018. State-of-the-art speech recognition with sequence-to-sequence models. In 2018
- Nadav Cohen, Or Sharir, and Amnon Shashua. 201On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory. 698–728.
- George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314.
- Olivier Delalleau and Yoshua Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in NIPS. 666–674.
- Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning for event-driven stock prediction. In Proceeding of the 24th IJCAI.
- Simon S Du and Jason D Lee. 2018. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206 (2018).
- Ronen Eldan and Ohad Shamir. 2016. The power of depth for feedforward neural networks. In Conference on learning theory. 907–940.
- Brendan J Frey and Geoffrey E Hinton. 1999. Variational learning in nonlinear Gaussian belief networks. Neural Computation 11, 1 (1999), 193–213.
- Hongyang Gao and Shuiwang Ji. 2019. Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD. 741–749.
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th AISTATS. 249–256.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
- Douglas M Hawkins. 2004. The problem of overfitting. Journal of chemical information and computer sciences 44, 1 (2004), 1–12.
- Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. 2018. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266 (2018).
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. stat 1050 (2015), 9.
- Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
- Barry L Kalman and Stan C Kwasny. 1992. Why tanh: choosing a sigmoidal function. In [Proceedings 1992] IJCNN, Vol. 4. IEEE, 578–581.
- Joe Kilian and Hava T Siegelmann. 1993. On the power of sigmoid neural networks. In Proceedings of the 6th annual conference on Computational learning theory. 137–143.
- Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009).
- Yann LeCun, Léon Bottou, Yoshua Bengio, et al. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
- Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. 2017. The expressive power of neural networks: A view from the width. In Advances in NIPS. 6231–6239.
- Wolfgang Maass, Georg Schnitger, and Eduardo D Sontag. 1994. A comparison of the computational power of sigmoid and Boolean threshold circuits. In Theoretical Advances in Neural Computation and Learning. Springer, 127–151.
- Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in NIPS. 2924–2932.
- Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th ICML. 807–814.
- Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. Sensitivity and Generalization in Neural Networks: an Empirical Study. In ICLR.
- Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall.
- 2018. Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378 (2018).
- [31] Razvan Pascanu, Guido Montufar, and Yoshua Bengio. 2013. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098 (2013).
- [32] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. 2016. Exponential expressivity in deep neural networks through transient chaos. In Advances in NIPS. 3360–3368.
- [33] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. 2017. On the expressive power of deep neural networks. In Proceedings of the 34th ICML-Volume 70. JMLR, 2847–2854.
- [34] Bernard W Silverman. 2018. Density estimation for statistics and data analysis. Routledge.
- [35] Wei-Hung Weng, Yu-An Chung, and Peter Szolovits. 2019. Unsupervised Clinical Language Translation. Proceedings of the 25th ACM SIGKDD (2019).

Full Text

Tags

Comments