Maximum-and-Concatenation Networks
ICML 2020, 2020.
Weibo:
Abstract:
While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability a...More
Code:
Data:
Introduction
- Deep neural networks (DNNs) have been showing superior performance in various fields such as computer vision, speech recognition, natural language processing, and so on.
- Some recent theories (Zhang et al, 2017; Wei & Ma, 2019; Cao & Gu, 2019; Li & Liang, 2018; Allen-Zhu et al, 2018; Arora et al, 2019c) have revealed that, whenever the local minima produces only small training error, DNNs have probably good generalization performance at these local minima.
- While impressive, existing studies are still unsatisfactory in some aspects:
Highlights
- Deep neural networks (DNNs) have been showing superior performance in various fields such as computer vision, speech recognition, natural language processing, and so on
- Unlike the previous analyses in (Liang et al, 2018a;b; Kawaguchi & Kaelbling, 2019), which focus on the elimination of local minima but ignore the generalization performance, we provide rigorous analysis to guarantee the generalization ability of Maximum-andConcatenation Networks under certain conditions (Theorem 3 and Corollary 3.1)
- We propose a novel multi-layer Deep neural networks structure termed Maximum-andConcatenation Networks, which can approximate some class of continuous functions arbitrarily well even with highly sparse connection
- We prove that the global minima of an l-layer Maximum-andConcatenation Networks may be outperformed, at least can be attained, by increasing the network depth
- Maximum-andConcatenation Networks could be appended to any of the many existing Deep neural networks and the augmented Deep neural networks will share the same property of Maximum-andConcatenation Networks
- We analyze the generalization ability of Maximum-andConcatenation Networks and reveal that depth is more important than width for generalization; this supports the mechanism of deep learning
Methods
- The authors first construct a baseline network with 6 weighted layers, including five convolutional layers and one fully-connected layer.
- The authors add convolutional layers to make the network deeper.
- It contains five max pooling in total.
- For the MCN, the authors replace the convolutional layers after the third max pooling layer with the MCN block.
- To make a fair comparison, both networks have the same number of layers and parameters, and so for the random seed and learning rate.
- For detailed experimental settings and model configurations, please refer to the supplementary material
Results
- The authors present the main results of this paper, including a couple of theories regarding the optimality, fitting ability and generalization performance.
- All the detailed proofs of these theorems are provided in the supplementary material.
- First note that an (l + 1)-layer MCN is obtained by adding one layer into the network consisting of its first l layers, i.e., θl+1 = {θl, θ(Ll+1, Wl+1, Al+1, Al+1)}.
- Theorem 1 (Effects of Depth).
- Let the activation function γ(·) be the element-wise exp(·).
- Suppose that the loss function (·) in (3) is differentiable and convex.
- Denote by θl+1 any local minimum of an (l + 1)-layer MCN.
Conclusion
- The authors propose a novel multi-layer DNN structure termed MCN, which can approximate some class of continuous functions arbitrarily well even with highly sparse connection.
- The authors prove that the global minima of an l-layer MCN may be outperformed, at least can be attained, by increasing the network depth.
- The authors analyze the generalization ability of MCN and reveal that depth is more important than width for generalization; this supports the mechanism of deep learning.
- This study does take a step towards the ultimate goal of deep learning theory—to understand why DNNs can work well in a wide variety of applications
Summary
Introduction:
Deep neural networks (DNNs) have been showing superior performance in various fields such as computer vision, speech recognition, natural language processing, and so on.- Some recent theories (Zhang et al, 2017; Wei & Ma, 2019; Cao & Gu, 2019; Li & Liang, 2018; Allen-Zhu et al, 2018; Arora et al, 2019c) have revealed that, whenever the local minima produces only small training error, DNNs have probably good generalization performance at these local minima.
- While impressive, existing studies are still unsatisfactory in some aspects:
Methods:
The authors first construct a baseline network with 6 weighted layers, including five convolutional layers and one fully-connected layer.- The authors add convolutional layers to make the network deeper.
- It contains five max pooling in total.
- For the MCN, the authors replace the convolutional layers after the third max pooling layer with the MCN block.
- To make a fair comparison, both networks have the same number of layers and parameters, and so for the random seed and learning rate.
- For detailed experimental settings and model configurations, please refer to the supplementary material
Results:
The authors present the main results of this paper, including a couple of theories regarding the optimality, fitting ability and generalization performance.- All the detailed proofs of these theorems are provided in the supplementary material.
- First note that an (l + 1)-layer MCN is obtained by adding one layer into the network consisting of its first l layers, i.e., θl+1 = {θl, θ(Ll+1, Wl+1, Al+1, Al+1)}.
- Theorem 1 (Effects of Depth).
- Let the activation function γ(·) be the element-wise exp(·).
- Suppose that the loss function (·) in (3) is differentiable and convex.
- Denote by θl+1 any local minimum of an (l + 1)-layer MCN.
Conclusion:
The authors propose a novel multi-layer DNN structure termed MCN, which can approximate some class of continuous functions arbitrarily well even with highly sparse connection.- The authors prove that the global minima of an l-layer MCN may be outperformed, at least can be attained, by increasing the network depth.
- The authors analyze the generalization ability of MCN and reveal that depth is more important than width for generalization; this supports the mechanism of deep learning.
- This study does take a step towards the ultimate goal of deep learning theory—to understand why DNNs can work well in a wide variety of applications
Tables
- Table1: The training error (Err.) and testing accuracy (Acc.) of different models on the CIFAR-10 dataset. We denote by C the added two convolutional layers and M the appended MCN blocks
- Table2: The training error (Err.) and testing accuracy (Acc.) of different models on the CIFAR-100 dataset. We denote by C the added two convolutional layers and M the appended MCN blocks
Funding
- This work is supported in part by New Generation AI Major Project of Ministry of Science and Technology of China (grant no 2018AAA0102501), in part by NSF China (grant no.s 61625301 and 61731018), in part by Major Scientific Research Project of Zhejiang Lab (grant no.s 2019KB0AC01 and 2019KB0AB02), in part by Fundamental Research Funds of Shandong University, in part by Beijing Academy of Artificial Intelligence, in part by Qualcomm, and in part by SenseTime Research Fund
Reference
- Adcock, B. Multivariate modified fourier series and application to boundary value problems. Numerische Mathematik, 115(4):511–552, 2010.
- Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
- Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252, 2019.
- Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In International Conference on Machine Learning, pp. 146–155, 2017.
- Arora, S., Cohen, N., Golowich, N., and Hu, W. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.
- Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pp. 7411–7422, 2019a.
- Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019b.
- Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Finegrained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019c.
- Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018a.
- Belkin, M., Hsu, D. J., and Mitra, P. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems, pp. 2300–2311, 2018b.
- Belkin, M., Rakhlin, A., and Tsybakov, A. B. Does data interpolation contradict statistical optimality? In International Conference on Artificial Intelligence and Statistics, pp. 1611–1619, 2019.
- Cao, Y. and Gu, Q. A generalization theory of gradient descent for learning over-parameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.
- Dou, X. and Liang, T. Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. arXiv preprint arXiv:1901.07114, 2019.
- Du, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R. R., and Singh, A. How many samples are needed to estimate a convolutional neural network? In Advances in Neural Information Processing Systems, pp. 373–383, 2018.
- Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, 2019a.
- Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b.
- E, W., Ma, C., Wu, L., et al. On the generalization properties of minimum-norm solutions for over-parameterized neural network models. arXiv preprint arXiv:1912.06987, 2019.
- Gasca, M. and Sauer, T. Polynomial interpolation in several variables. Advances in Computational Mathematics, 12 (4):377, 2000.
- Giné, E. and Nickl, R. Mathematical foundations of infinitedimensional statistical models, volume 40. Cambridge University Press, 2016.
- Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In International Conference on Machine Learning, pp. 1319–1327, 2013.
- Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
- Hardt, M. and Ma, T. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- Hoffer, E., Hubara, I., and Soudry, D. Fix your classifier: the marginal value of training the last weight layer. arXiv preprint arXiv:1801.04540, 2018.
- Horn, R. A. and Johnson, C. R. Topics in matrix analysis, 1991. Cambridge University Presss, 37:39, 1991.
- Huybrechs, D., Iserles, A., et al. From high oscillation to rapid approximation iv: Accelerating convergence. IMA Journal of Numerical Analysis, 31(2):442–468, 2011.
- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Kawaguchi, K. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586–594, 2016.
- Kawaguchi, K. and Kaelbling, L. P. Elimination of all bad local minima in deep learning. arXiv preprint arXiv:1901.00279, 2019.
- Kawaguchi, K., Huang, J., and Kaelbling, L. P. Effect of depth and width on local minima in deep learning. Neural Computation, 2019.
- Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166, 2018.
- Liang, S., Sun, R., Lee, J. D., and Srikant, R. Adding one neuron can eliminate all bad local minima. In Advances in Neural Information Processing Systems, pp. 4350–4360, 2018a.
- Liang, S., Sun, R., Li, Y., and Srikant, R. Understanding the loss surface of neural networks for binary classification. arXiv preprint arXiv:1803.00909, 2018b.
- Liang, S., Sun, R., and Srikant, R. Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. arXiv preprint arXiv:1912.13472, 2019.
- Liang, T., Rakhlin, A., and Zhai, X. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. arXiv preprint arXiv:1908.10292 [cs, math, stat], 2020.
- Liu, J., Chen, X., Wang, Z., and Yin, W. ALISTA: Analytic weights are as good as learned weights in LISTA. In International Conference on Learning Representations, 2019.
- Lu, J., Shen, Z., Yang, H., and Zhang, S. Deep network approximation for smooth functions. arXiv preprint arXiv:2001.03040, 2020.
- Luxburg, U. v. and Bousquet, O. Distance-based classification with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004.
- Ma, C., Wu, L., et al. A priori estimates of the generalization error for two-layer neural networks. arXiv preprint arXiv:1810.06397, 2018.
- Maillard, O. and Munos, R. Compressed least-squares regression. In Advances in Neural Information Processing Systems, pp. 1213–1221, 2009.
- Olver, S. On the convergence rate of a modified fourier series. Mathematics of Computation, 78(267):1629–1645, 2009.
- Rakhlin, A., Sridharan, K., Tsybakov, A. B., et al. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23 (2):789–824, 2017.
- Rumerlhar, D. Learning representation by back-propagating errors. Nature, 323:533–536, 1986.
- Schmidt-Hieber, J. Nonparametric regression using deep neural networks with relu activation function. Annals of Statistics, 2019.
- Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. 2014.
- Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sohl-Dickstein, J. and Kawaguchi, K. Eliminating all bad local minima from loss landscapes without even adding an extra unit. arXiv preprint arXiv:1901.03909, 2019.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Telgarsky, M. Benefits of depth in neural networks. In Conference on Learning Theory, pp. 1517–1539, 2016.
- Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Wei, C. and Ma, T. Data-dependent sample complexity of deep neural networks via lipschitz augmentation. arXiv preprint arXiv:1905.03684, 2019.
- Wu, Y. and He, K. Group normalization. In European Conference on Computer Vision, pp. 3–19, 2018.
- Xie, B., Liang, Y., and Song, L. Diverse neural network learns true target functions. In International Conference on Artificial Intelligence and Statistics, pp. 1216–1224, 2017.
- Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995, 2017.
- Xie, X., Wu, J., Zhong, Z., Liu, G., and Lin, Z. Differentiable linearized ADMM. In International Conference on Machine Learning, 2019.
- Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pp. 639–649, 2018.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Machine Learning, 2017.
- Zhang, X., Ling, C., and Qi, L. The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM Journal on Matrix Analysis and Applications, 33(3):806–821, 2012.
Full Text
Tags
Comments