# dS^2LBI: Exploring Structural Sparsity on Deep Network via Differential Inclusion Paths

international conference on machine learning, 2020.

Weibo:

Abstract:

Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, inste...More

Code:

Data:

Introduction

- The expressive power of deep neural networks comes from the millions of parameters, which are optimized by Stochastic Gradient Descent (SGD) (Bottou, 2010) and variants like Adam (Kingma & Ba, 2015).
- Despite that 1 regularization has been applied to deep learning to enforce the sparsity on weights toward compact, memory efficient networks, it sacrifices some prediction performance (Collins & Kohli, 2014)
- This is because that the weights learned in neural networks are highly correlated, and 1 regularization on such weights violates the incoherence or irrepresentable conditions needed for sparse model selection (Donoho & Huo, 2001; Tropp, 2004; Zhao & Yu, 2006), leading to spurious selections with poor generalization.
- Group sparsity regularization (Yuan & Lin, 2006) has been applied to neural networks, such as finding optimal number of neuron groups (Alvarez & Salzmann, 2016) and exerting good data locality with structured sparsity (Wen et al, 2016; Yoon & Hwang, 2017)

Highlights

- The expressive power of deep neural networks comes from the millions of parameters, which are optimized by Stochastic Gradient Descent (SGD) (Bottou, 2010) and variants like Adam (Kingma & Ba, 2015)
- This is because that the weights learned in neural networks are highly correlated, and 1 regularization on such weights violates the incoherence or irrepresentable conditions needed for sparse model selection (Donoho & Huo, 2001; Tropp, 2004; Zhao & Yu, 2006), leading to spurious selections with poor generalization
- Existing work on SplitLBI is restricted to convex problems in generalized linear modes. It remains unknown whether the algorithm can exploit the structural sparsity in highly non-convex deep networks. To fill in this gap, in this paper, we propose the deep Structural Splitting Linearized Bregman Iteration that simultaneously explores the overparameterized networks and the structural sparsity of the weights of fully connected and convolutional layers in such networks, which enables us to generate an iterative solution path of deep models whose important sparse architectures are unveiled in early stopping
- This paper presents a novel algorithm – DessiLBI in exploring structural sparsity of deep network
- Extensive experiments reveal the effectiveness of our algorithm in training over-parameterized models and exploring effective sparse architecture of deep models

Methods

- Supervised learning learns ΦW : X → Y, from input X to output space Y, with a parameter W such as weights in neural networks, by minimizing certain loss functions on training samples Ln(W ) = 1 n n i=1 (yi, ΦW).
- A neural network of l-layer is defined as ΦW (x) = σl W lσl−1 W l−1 · · · σ1 W 1x , where W = {W i}li=1, σi is the nonlinear activation function of the i-th layer.
- Wt κ −∇W L (Wt, Γt) (3a)

Conclusion

- This paper presents a novel algorithm – DessiLBI in exploring structural sparsity of deep network.
- It is derived from differential inclusions of inverse scale space, with a proven global convergence to KKT points from arbitrary initializations.
- Extensive experiments reveal the effectiveness of the algorithm in training over-parameterized models and exploring effective sparse architecture of deep models

Summary

## Introduction:

The expressive power of deep neural networks comes from the millions of parameters, which are optimized by Stochastic Gradient Descent (SGD) (Bottou, 2010) and variants like Adam (Kingma & Ba, 2015).- Despite that 1 regularization has been applied to deep learning to enforce the sparsity on weights toward compact, memory efficient networks, it sacrifices some prediction performance (Collins & Kohli, 2014)
- This is because that the weights learned in neural networks are highly correlated, and 1 regularization on such weights violates the incoherence or irrepresentable conditions needed for sparse model selection (Donoho & Huo, 2001; Tropp, 2004; Zhao & Yu, 2006), leading to spurious selections with poor generalization.
- Group sparsity regularization (Yuan & Lin, 2006) has been applied to neural networks, such as finding optimal number of neuron groups (Alvarez & Salzmann, 2016) and exerting good data locality with structured sparsity (Wen et al, 2016; Yoon & Hwang, 2017)
## Methods:

Supervised learning learns ΦW : X → Y, from input X to output space Y, with a parameter W such as weights in neural networks, by minimizing certain loss functions on training samples Ln(W ) = 1 n n i=1 (yi, ΦW).- A neural network of l-layer is defined as ΦW (x) = σl W lσl−1 W l−1 · · · σ1 W 1x , where W = {W i}li=1, σi is the nonlinear activation function of the i-th layer.
- Wt κ −∇W L (Wt, Γt) (3a)
## Conclusion:

This paper presents a novel algorithm – DessiLBI in exploring structural sparsity of deep network.- It is derived from differential inclusions of inverse scale space, with a proven global convergence to KKT points from arbitrary initializations.
- Extensive experiments reveal the effectiveness of the algorithm in training over-parameterized models and exploring effective sparse architecture of deep models

- Table1: Top-1/Top-5 accuracy(%) on ImageNet-2012. : results from the official pytorch website. We use the official pytorch codes to run the competitors. More results on MNIST/Cifar-10, please refer Tab. 2 in supplementary
- Table2: Top-1/Top-5 accuracy(%) on ImageNet-2012 and test accuracy on MNIST/Cifar-10. : results from the official pytorch website. We use the official pytorch codes to run the competitors. All models are trained by 100 epochs. In this table, we run the experiment by ourselves except for SGD Mom-Wd on ImageNet which is reported in https://pytorch.org/docs/stable/torchvision/models.html
- Table3: This table shows results for different κ, the results are all the best test accuracy. Here we test two widely-used models: VGG16 and ResNet56 on Cifar10. For results in this table, we keep ν = 100. Full means that we use the trained model weights directly, Sparse means the model weights are combined with mask generated by Γ support. Sparse result has no finetuning process, the result is comparable to its Full counterpart. For this experiment, we propose that κ = 1 is a good choice. For all the model, we train for 160 epochs with initial learning rate (lr) of 0. 1 and decrease by 0.1 at epoch 80 and 120
- Table4: Sparsity rate and validation accuracy for different κ at different epochs. Here we pick the test accuracy for specific epoch. In this experiment, we keep ν = 100. We pick epoch 20, 40, 80 and 160 to show the growth of sparsity and sparse model accuracy. Here Sparsity is defined in Sec. 5, and Acc means the test accuracy for sparse model. A sparse model is a model at designated epoch t combined with the mask as the support of Γt
- Table5: Results for different ν, the results are all the best test accuracy. Here we test two widely-used model : VGG16 and ResNet56 on Cifar10. For results in this table, we keep κ = 1. Full means that we use the trained model weights directly, Sparse means the model weights are combined with mask generated by Γ support. Sparse result has no finetuning process, the result is comparable to its Full counterpart. For all the model, we train for 160 epochs with initial learning rate (lr) of 0.1 and decrease by 0.1 at epoch 80 and 120
- Table6: Sparsity rate and validation accuracy for different ν at different epochs. Here we pick the test accuracy for specific epoch. In this experiment, we keep κ = 1. We pick epoch 20, 40, 80 and 160 to show the growth of sparsity and sparse model accuracy. Here Sparsity is defined in Sec. 5 as the percentage of nonzero parameters, and Acc means the test accuracy for sparse model. A sparse model is a model at designated epoch t combined with mask as the support of Γt
- Table7: Computational and Memory Costs
- Table8: This table shows the sparsity for every layer of Lenet-3. Here sparsity is defined in Sec. 5, number of weights denotes the total number of parameters in the designated layer. It is interesting that the Γ tends to put lower sparsity on layer with more parameters
- Table9: This table shows the sparsity for every layer of Conv-2. Here sparsity is defined in Sec. 5, number of weights denotes the total number of parameters in the designated layer. The sparsity is more significant in fully connected (FC) layers than convolutional layers
- Table10: This table shows the sparsity for every layer of Conv-4. Here sparsity is defined in Sec. 5, number of weights denotes the total number of parameters in the designated layer. Most of the convolutional layers are kept while the FC layers are very sparse
- Table11: Hyperparameter setting for the experiments in Figure 5

Related work

- Mirror Descent Algorithm (MDA) firstly proposed by (Nemirovski & Yudin, 1983) to solve constrained convex optimization L := minW ∈K L(W ) (K is convex and compact), can be understood as a generalized projected gradient descent (Beck & Teboulle, 2003) with respect to Bregman distance BΩ(u, v) := Ω(u) − Ω(v) − ∇Ω(v), u − v induced by a convex and differentiable function Ω(·), Zk+1 = Zk − α∇L(Wk) (1a)

Wk+1 = ∇Ω (Zk+1),

(1b) where the conjugate function of Ω(·) is Ω (Z) :=

supW W, Z − Ω(W ). Equation (1) optimizes Wk+1 = arg minz z, αL(Wk) + BΩ(z, Wk) (Nemirovski) in two steps: Eq (1a) implements the gradient descent on Z that is an element in dual space Zk = ∇Ω(Wk); and Eq (1b) projects it back to the primal space. As step size α → 0, MDA has the following limit dynamics as ordinary differential equation (ODE) (Nemirovski & Yudin, 1983): Zt = α∇L(Wt) (2a) Wt = ∇Ω (Zt), (2b)

Convergence analysis with rates have been well studied for convex loss, that has been extended to stochastic version (Ghadimi & Lan, 2012; Nedic & Lee, 2014) and

Nesterov acceleration scheme (Su et al, 2016; Krichene et al, 2015). For highly non-convex loss met in deep learning, (Azizan et al, 2019) established the convergence to global optima for overparameterized networks, provided that (i) the initial point is close enough to the manifold of global optima; (ii) the Ω(·) is strongly convex and differentiable. For non-differentiable Ω such as the Elastic Net penalty in compressed sensing and high dimensional statistics (Ω(W ) = W

Funding

- This work was supported in part by NSFC Projects (61977038), Science and Technology Commission of Shanghai Municipality Projects (19511120700, 19ZR1471800), and Shanghai Research and Innovation Functional Program (17DZ2260900)
- Dr Zeng is supported by the Two Thousand Talents Plan of Jiangxi Province
- The research of Yuan Yao was supported in part by Hong Kong Research Grant Council (HKRGC) grant 16303817, ITF UIM/390, as well as awards from Tencent AI Lab, Si Family Foundation, and Microsoft Research-Asia

Reference

- Abbasi-Asl, R. and Yu, B. Structural compression of convolutional neural networks based on greedy filter pruning. arXiv preprint arXiv:1705.07356, 2017. 1, 5.2
- Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. 2018. arXiv:1811.03961
- Alvarez, J. M. and Salzmann, M. Learning the number of neurons in deep networks. In NIPS, 2016. 1
- Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018. 1
- Attouch, H., Bolte, J., and Svaiter, B. F. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming, 137:91–129, 2013. 1, A.1, A.3
- Azizan, N., Lale, S., and Hassibi, B. Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830, 2019. 2
- Bartlett, P., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. In The 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA. 2011
- Bartlett, P. L. For valid generalization the size of the weights is more important than the size of the network. In Mozer, M. C., Jordan, M. I., and Petsche, T. (eds.), Advances in Neural Information Processing Systems 9, pp. 134–140. MIT Press, 1997. 1
- Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003. 2
- Benning, M., Betcke, M. M., Ehrhardt, M. J., and SchonlieB, C.-B. Choose your path wisely: gradient descent in a bregman distance framework. arXiv preprint arXiv:1712.04045, 2017. A.3
- Bochnak, J., Coste, M., and Roy, M.-F. Real algebraic geometry, volume 3. Ergeb. Math. Grenzgeb. SpringerVerlag, Berlin, 1998. 3, A.1
- Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17:1205–1223, 2007a. A.1, A.1
- Bolte, J., Daniilidis, A., Lewis, A., and Shiota, M. Clark subgradients of stratifiable functions. SIAM Journal on Optimization, 18:556–572, 2007b. A.1, A.1, A.2, A.2
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010. 1
- Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1–122, 2011. 2
- Burger, M., Gilboa, G., Osher, S., and Xu, J. Nonlinear inverse scale space methods. Communications in Mathematical Sciences, 4(1):179–212, 2006. 3
- Cai, J.-F., Osher, S., and Shen, Z. Convergence of the linearized bregman iteration for l1-norm minimization. Mathematics of Computation, 2009. 2
- Collins, M. and Kohli, P. Memory bounded deep convolutional networks. In arXiv preprint arXiv:1412.1442, 2014, 2014. 1
- Coste, M. An introduction to o-minimal geometry. RAAG Notes, 81 pages, Institut de Recherche Mathematiques de Rennes, 1999. A.2
- Donoho, D. L. and Huo, X. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47(7):2845–2862, 2001. 1
- Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. 2018. arXiv:1811.03804. 1
- Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of a deep network. University of Montreal, Technical Report, 1341, 2009. 1
- Franca, G., Robinson, D. P., and Vidal, R. Admm and accelerated admm as continuous dynamical systems. arXiv preprint arXiv:1805.06579, 2018. 2
- Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. International Conference on Learning Representations (ICLR), 2019. arXiv preprint arXiv:1803.03635. 1, 5.4, D
- Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. The lottery ticket hypothesis at scale. arXiv preprint arXiv:1903.01611, 2019. D
- Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. International Conference on Learning Representations (ICLR), 2019. arXiv preprint arXiv:1811.12231. 5.2
- Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012. 2
- Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. Conference on Learning Theory (COLT), 2018. arXiv preprint arXiv:1712.06541. 1
- Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In NIPS, 2015. 1, 5, 5.4
- He, B. and Yuan, X. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012. 2
- He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015. 5
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. 5
- Huang, C. and Yao, Y. A unified dynamic approach to sparse model selection. In The 21st International Conference on Artificial Intelligence and Statistics (AISTATS), Lanzarote, Spain, 2018. 2, 4
- Huang, C., Sun, X., Xiong, J., and Yao, Y. Split lbi: An iterative regularization path with structural sparsity. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems (NIPS) 29, pp. 3369–3377. 2016. 2, 3
- Huang, C., Sun, X., Xiong, J., and Yao, Y. Boosting with structural sparsity: A differential inclusion approach. Applied and Computational Harmonic Analysis, 2018. arXiv preprint arXiv:1704.04833. 2
- Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014. 1
- Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. 1
- Krantz, S. and Parks, H. R. A primer of real analytic functions. Birkhauser, second edition, 2002. 2, A.2
- Krichene, W., Bayen, A., and Bartlett, P. L. Accelerated mirror descent in continuous and discrete time. In Advances in neural information processing systems, pp. 2845–2853, 2015. 2
- Kurdyka, K. On gradients of functions definable in ominimal structures. Annales de l’institut Fourier, 48: 769–783, 1998. A.1, A.1, A.2, A.2
- Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. In ICLR, 2017. 1
- Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. In ICCV, 2017. 5.4
- Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the value of network pruning. In ICLR, 2019. 1, 5, 5.4
- Łojasiewicz, S. Une proprietetopologique des sousensembles analytiques reels. In: Les Equations aux derivees partielles. Editions du centre National de la Recherche Scientifique, Paris, pp. 87–89, 1963. 4, A.1
- Łojasiewicz, S. Ensembles semi-analytiques. Institut des Hautes Etudes Scientifiques, 1965. A.1
- Łojasiewicz, S. Sur la geometrie semi-et sous-analytique. Annales de l’institut Fourier, 43:1575–1595, 1993. A.1
- Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (ICLR), 2019. arXiv preprint arXiv:1711.05101. 1
- Mei, S., Montanari, A., and Nguyen, P.-M. A mean field view of the landscape of two-layers neural network. Proceedings of the National Academy of Sciences (PNAS), 2018. 1
- Mei, S., Misiakiewicz, T., and Montanari, A. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. Conference on Learning Theory (COLT), 2019. 1
- Mordukhovich, B. S. Variational analysis and generalized differentiation I: Basic Theory. Springer, 2006. A.1
- Nedic, A. and Lee, S. On stochastic subgradient mirrordescent algorithm with weighted averaging. SIAM Journal on Optimization, 24(1):84–107, 2014. 2
- Nemirovski, A. and Yudin, D. Problem complexity and Method Efficiency in Optimization. New York: Wiley, 1983. Nauka Publishers, Moscow (in Russian), 1978. 2, 2
- Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations (ICLR), New Orleans, Louisiana, USA. 2019. 1
- Osher, S., Burger, M., Goldfarb, D., Xu, J., and Yin, W. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 4 (2):460–489, 2005. 2
- Osher, S., Ruan, F., Xiong, J., Yao, Y., and Yin, W. Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis, 2016. 2, 3
- Rockafellar, R. T. and Wets, R. J.-B. Variational analysis. Grundlehren Math. Wiss. 317, Springer-Verlag, New York, 1998. A.1
- Shiota, M. Geometry of subanalytic and semialgebraic sets, volume 150 of Progress in Mathematics. Birkhauser, Boston, 1997. A.1
- Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014. 2, 6
- Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312–5354, 2016. 2
- Tropp, J. A. Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory, 50(10): 2231–2242, 2004. 1, 2 van den Dries, L. A generalization of the tarski-seidenberg theorem and some nondefinability results. Bull. Amer. Math. Soc. (N.S.), 15:189–193, 1986. A.2 van den Dries, L. and Miller, C. Geometric categories and o-minimal structures. Duke Mathematical Journal, 84: 497–540, 1996. A.2
- Venturi, L., Bandeira, A. S., and Bruna, J. Spurious valleys in two-layer neural network optimization landscapes. 2018. arXiv:1802.06384. 1
- Wahlberg, B., Boyd, S., Annergren, M., and Wang, Y. An admm algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16): 83–88, 2012. 2
- Wang, H. and Banerjee, A. Online alternating direction method (longer version). arXiv preprint arXiv:1306.3721, 2013. 2
- Wang, H. and Banerjee, A. Bregman alternating direction method of multipliers. In Advances in Neural Information Processing Systems, pp. 2816–2824, 2014. 2
- Wang, Y., Yin, W., and Zeng, J. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, 78(1):29–63, 2019. A.1
- Wei, Y., Yang, F., and Wainwright, M. J. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. The 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 2017. 1
- Zhu, W., Huang, Y., and Yao, Y. On breiman’s dilemma in neural networks: Phase transitions of margin dynamics. arXiv:1810.03389, 2018. 1
- Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning the number of neurons in deep networks. In NIPS, 2016. 1
- Xue, F. and Xin, J. Convergence of a relaxed variable splitting method for learning sparse neural networks via 1, 0, and transformed- 1 penalties. arXiv:1812.05719v2, 2018. URL http://arxiv.org/abs/1812.05719.4
- Yang, H., Kang, G., Dong, X., Fu, Y., and Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI 2018, 2018. 5.4
- Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007. 1
- Yin, W., Osher, S., Darbon, J., and Goldfarb, D. Bregman iterative algorithms for compressed sensing and related problems. SIAM Journal on Imaging sciences, 1(1):143– 168, 2008. 2
- Yoon, J. and Hwang, S. J. Combined group and exclusive sparsity for deep neural networks. In ICML, 2017. 1
- Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (1):49–67, 2006. 1
- Zeng, J., Lau, T. T.-K., Lin, S.-B., and Yao, Y. Global convergence of block coordinate descent in deep learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, 2019a. URL https://arxiv.org/abs/1803.00225.4, A.2
- Zeng, J., Lin, S.-B., and Yao, Y. A convergence analysis of nonlinearly constrained admm in deep learning. arXiv preprint arXiv:1902.02060, 2019b. 2
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR), 2017. arXiv:1611.03530. 1
- Zhao, P. and Yu, B. On model selection consistency of lasso. J. Machine Learning Research, 7:2541–2567, 2006. 1, 2
- Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017. 5, 5.4

Full Text

Tags

Comments