Discrete Model Compression With Resource Constraint for Deep Neural Networks

CVPR, pp. 1896-1905, 2020.

Cited by: 0|Bibtex|Views44|DOI:https://doi.org/10.1109/CVPR42600.2020.00197
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We proposed an effective discrete model compression method to prune Convolutional Neural Networks given certain resource constraints

Abstract:

In this paper, we target to address the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Specifically, we propose a novel structural pruning method to obtain a compact CNN with strong discriminative power. To find such networks, we propose an efficient discrete optimization method to directly optimize chann...More

Code:

Data:

0
Introduction
  • Convolutional Neural Networks (CNNs) have achieved great success in computer vision tasks [23, 39, 40, 42, 2].
  • With more and more sophisticated GPU support on CNNs, the complexity of CNN grows dramatically from several layers [23, 43] to hundreds of layers [9, 17]
  • These complex CNNs can achieve strong performance on vision tasks, there is an unavoidable growth of the computational cost and model parameters.
  • Many efforts [8, 7] have been devoted for getting compact sub-networks from the original computational heavy model
Highlights
  • Convolutional Neural Networks (CNNs) have achieved great success in computer vision tasks [23, 39, 40, 42, 2]
  • We have introduced the core idea of our method, and the discrete model compression (DMC) algorithm is presented in Algorithm 1
  • For ResNet-34, our method can prune 43.4% float point operations while only result in 0.73% and 0.31% performance drops on Top-1 accuracy and Top-5 accuracy separately
  • FPGM prunes slightly less float point operations compared with our method (41.1% vs 43.3%), it causes larger damage to the final performance than our method (0.56% worse with Top-Method Pruning Filters
  • We proposed an effective discrete model compression method to prune Convolutional Neural Networks given certain resource constraints
  • To further enlarge the space, we introduced the symmetric weight decay on the gate parameters inspired by the fact that regularization loss can be regarded as weight decay
Methods
  • The authors' method outperforms DCP [49] by 0.38% on ∆-Acc with the same pruned FLOPs. the method outperforms DCPadapt by 0.06% give similar pruned FLOPs. Collaborative filter pruning [37] is one of the most recent works on channel pruning which considers the correlation between different weights when applying Taylor expansion on the loss function.
  • For MobileNetV2, the method outperforms DCP by 0.04% on ∆-Acc, while pruning 14% more FLOPs than DCP
  • This shows that global discrimination-aware is better than local discrimination-aware.
  • IE [34] FPGM [12] GAL [26] DCP [49] CCP [37] DMC Rethinking [46]
Results
  • Results on ImageNet

    In Tab. 2, the authors presents all the comparison results on ImageNet. All results are adopted from their original papers except for ThiNet on MobileNetV2.
  • DCP [49], CCP [37], IE [34], FPGM [12] and GAL [26] are from this category.
  • Such high-quality baselines can help them better understand the benefit of using discrete channel settings.
  • FPGM prunes slightly less FLOPs compared with the method (41.1% vs 43.3%), it causes larger damage to the final performance than the method (0.56% worse with Top-
Conclusion
  • The authors proposed an effective discrete model compression method to prune CNNs given certain resource constraints.
  • By turning deterministic discrete gate to stochastic discrete gate, the method can explore larger search space of sub-networks.
  • To further enlarge the space, the authors introduced the symmetric weight decay on the gate parameters inspired by the fact that regularization loss can be regarded as weight decay.
  • The authors' method benefits from the exact estimation of sub-networks’ outputs because of a combination of the precise placement of gates and the discrete setting.
  • Extensive experiments results on ImageNet and CIFAR-10 show that the method outperforms state-ofthe-art methods
Summary
  • Introduction:

    Convolutional Neural Networks (CNNs) have achieved great success in computer vision tasks [23, 39, 40, 42, 2].
  • With more and more sophisticated GPU support on CNNs, the complexity of CNN grows dramatically from several layers [23, 43] to hundreds of layers [9, 17]
  • These complex CNNs can achieve strong performance on vision tasks, there is an unavoidable growth of the computational cost and model parameters.
  • Many efforts [8, 7] have been devoted for getting compact sub-networks from the original computational heavy model
  • Methods:

    The authors' method outperforms DCP [49] by 0.38% on ∆-Acc with the same pruned FLOPs. the method outperforms DCPadapt by 0.06% give similar pruned FLOPs. Collaborative filter pruning [37] is one of the most recent works on channel pruning which considers the correlation between different weights when applying Taylor expansion on the loss function.
  • For MobileNetV2, the method outperforms DCP by 0.04% on ∆-Acc, while pruning 14% more FLOPs than DCP
  • This shows that global discrimination-aware is better than local discrimination-aware.
  • IE [34] FPGM [12] GAL [26] DCP [49] CCP [37] DMC Rethinking [46]
  • Results:

    Results on ImageNet

    In Tab. 2, the authors presents all the comparison results on ImageNet. All results are adopted from their original papers except for ThiNet on MobileNetV2.
  • DCP [49], CCP [37], IE [34], FPGM [12] and GAL [26] are from this category.
  • Such high-quality baselines can help them better understand the benefit of using discrete channel settings.
  • FPGM prunes slightly less FLOPs compared with the method (41.1% vs 43.3%), it causes larger damage to the final performance than the method (0.56% worse with Top-
  • Conclusion:

    The authors proposed an effective discrete model compression method to prune CNNs given certain resource constraints.
  • By turning deterministic discrete gate to stochastic discrete gate, the method can explore larger search space of sub-networks.
  • To further enlarge the space, the authors introduced the symmetric weight decay on the gate parameters inspired by the fact that regularization loss can be regarded as weight decay.
  • The authors' method benefits from the exact estimation of sub-networks’ outputs because of a combination of the precise placement of gates and the discrete setting.
  • Extensive experiments results on ImageNet and CIFAR-10 show that the method outperforms state-ofthe-art methods
Tables
  • Table1: Comparison results on CIFAR-10 dataset with ResNet-56 and MobileNetV2. ∆-Acc represents the performance changes before and after model pruning. +/- indicates increase or decrease compared to baseline results. WM represents width multiplier used in original design of MobileNetV2, this result is from DCP [<a class="ref-link" id="c49" href="#r49">49</a>] paper
  • Table2: Comparison results on ImageNet dataset with ResNet-34, ResNet-50, ResNet-101 and MobileNetV2. ∆-Acc represents the performance changes before and after model pruning. +/- indicates increase or decrease compared to baseline results. ThiNet on MobileNetV2 results are from DCP [<a class="ref-link" id="c49" href="#r49">49</a>] paper
  • Table3: Performance of pruned models given different gate settings on ImageNet
Download tables as Excel
Related work
  • Model compression recently has drawn a lot of attention from the compute vision community. In general, current model compression methods can be separated into the following four categories: weight pruning, structural pruning, weight quantization, and knowledge distillation [14].

    Weight pruning eliminates model parameters without any assumption on the structure of weights. One of the early works [8] uses L1 or L2 magnitude as criterion to remove weights. Under this setting, parameters lower than a certain threshold are removed, and weights with small magnitude are considered not important. A systematic DNN weight pruning framework [48] has been proposed by using alternating direction method of multipliers (ADMM) [3, 15, 16]. Different from the aforementioned works, SNIP [24] updates the importance of weights by backpropagating gradients from the loss function. Lottery ticket hypothesis [6]
Funding
  • This work was partially supported by U.S NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956
Reference
  • Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
    Findings
  • Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
    Findings
  • Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1–122, 2011.
    Google ScholarLocate open access versionFindings
  • Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
    Findings
  • Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2234–2240, 2018.
    Google ScholarLocate open access versionFindings
  • Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784– 800, 2018.
    Google ScholarLocate open access versionFindings
  • Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4340–4349, 2019.
    Google ScholarLocate open access versionFindings
  • Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Feihu Huang, Songcan Chen, and Heng Huang. Faster stochastic alternating direction method of multipliers for nonconvex optimization. In International Conference on Machine Learning, pages 2839–2848, 2019.
    Google ScholarLocate open access versionFindings
  • Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Nonconvex zeroth-order stochastic admm methods with lower function query complexity. arXiv preprint arXiv:1907.13463, 2019.
    Findings
  • Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
    Google ScholarLocate open access versionFindings
  • Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning Volume 37, ICML, pages 448–456. JMLR.org, 2015.
    Google ScholarLocate open access versionFindings
  • Jaedeok Kim, Chiyoun Park, Hyun-Joo Jung, and Yoonsuck Choe. Plug-in, trainable gate for streamlining arbitrary neural networks. CoRR, abs/1904.10921, 2019.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
    Google ScholarFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2790–2799, 2019.
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. arXiv preprint arXiv:1805.08941, 2018.
    Findings
  • Jian-Hao Luo, Hao Zhang, Hong-Yu Zhou, Chen-Wei Xie, Jianxin Wu, and Weiyao Lin. Thinet: pruning cnn filters for a thinner net. IEEE transactions on pattern analysis and machine intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2498–2507. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
    Google ScholarLocate open access versionFindings
  • Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Advances in Neural Information Processing Systems, pages 6775–6784, 2017.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
    Google ScholarFindings
  • Hanyu Peng, Jiaxiang Wu, Shifeng Chen, and Junzhou Huang. Collaborative channel pruning for deep networks. In International Conference on Machine Learning, pages 5113–5122, 2019.
    Google ScholarLocate open access versionFindings
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568– 576, 2014.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Chaoqi Wang, Roger B. Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kroneckerfactored eigenbasis. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pages 6566– 6575, 2019.
    Google ScholarLocate open access versionFindings
  • Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
    Google ScholarLocate open access versionFindings
  • Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous parameter tying and sparsification in deep learning. 2018.
    Google ScholarFindings
  • Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 184–199, 2018.
    Google ScholarLocate open access versionFindings
  • Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875–886, 2018.
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. 2017.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments