# Provable Filter Pruning for Efficient Neural Networks

ICLR, 2020.

EI

Keywords:

Weibo:

Abstract:

We present a provable, sampling-based approach for generating compact Convolutional Neural Networks (CNNs) by identifying and removing redundant filters from an over-parameterized network. Our algorithm uses a small batch of input data points to assign a saliency score to each filter and constructs an importance sampling distribution wher...More

Introduction

- Modern networks with millions of parameters require excessive amounts of memory and computational resources to store and conduct inference.
- A common practice to obtain small, efficient network architectures is to train an over-parameterized network, prune it by removing the least significant weights, and re-train the pruned network (Gale et al, 2019; Frankle & Carbin, 2019; Han et al, 2015; Baykal et al, 2019b)
- This prune-retrain cycle is often repeated iteratively until the network cannot be pruned any further without incurring a significant loss in predictive accuracy relative to that of the original model.
- A diverse set of smart pruning strategies have been proposed in order to generate compact, accurate neural network models in a computationally efficient way

Highlights

- Despite widespread empirical success, modern networks with millions of parameters require excessive amounts of memory and computational resources to store and conduct inference
- We evaluate and compare the effectiveness of our approach in pruning a diverse set of network architectures trained on real-world data sets
- We evaluate and compare our algorithm’s performance to that of state-of-the-art pruning schemes in generating compact networks that retain the predictive accuracy of the original model
- Our evaluations show that our approach generates significantly smaller and more efficient models compared to those generated by competing methods
- Our results demonstrate the practicality and wide-spread applicability of our proposed approach: across all of our experiments, our algorithm took on the order of a minute to prune a given network5, required no manual tuning of its hyperparameters, and performed consistently well across a diverse set of pruning scenarios
- We presented – to the best of our knowledge – the first filter pruning algorithm that generates a pruned network with theoretical guarantees on the size and performance of the generated network

Methods

- Ours Ours Ours He et al (2018) (SoftNet) He et al (2019) Dong et al (2017) Ours Ours He et al (2018) (SoftNet) Luo et al (2017) (ThiNet) He et al (2019) He et al (2017) Luo & Wu (2018) Liu et al (2019a) Ours Ours He et al (2018) (SoftNet) He et al (2019) Ye et al (2018)

Top-5 Err. (%) Orig. - The authors consider pruning the networks using the standard iterative prune-retrain procedure as before with only a limited number of iterations (2-3 iterations per reported experiment).
- The results of the evaluations are reported in Table 6 with respect to the following metrics: the resulting error of the pruned network (Pruned Err.), the difference in model classification error (Err. Diff), the percentage of parameters pruned (PR), and the FLOP Reduction (FR).
- Ours Ours Ours He et al (2018) (SoftNet) He et al (2019) Ye et al (2018) Lin et al (2020) Ours Ours Ours Li et al (2016) (FT) He et al (2018) (SoftNet) He et al (2019) He et al (2017) Li et al (2019) Lin et al (2020) Ours Ours Ours Li et al (2016) (FT) He et al (2018) (SoftNet) He et al (2019) Dong et al (2017) Ours Ours Ours Li et al (2016) (FT) Huang et al (2018) He et al (2019) Li et al (2019)

Results

- The authors evaluate and compare the algorithm’s performance to that of state-of-the-art pruning schemes in generating compact networks that retain the predictive accuracy of the original model.
- The authors' evaluations show that the approach generates significantly smaller and more efficient models compared to those generated by competing methods.
- The authors' results demonstrate the practicality and wide-spread applicability of the proposed approach: across all of the experiments, the algorithm took on the order of a minute to prune a given network5, required no manual tuning of its hyperparameters, and performed consistently well across a diverse set of pruning scenarios.
- Additional results, comparisons, and experimental details can be found in Sec. E of the appendix

Conclusion

- In addition to the favorable empirical results of the algorithm, the approach exhibits various advantages over competing methods that manifest themselves in the empirical evaluations.
- The authors' algorithm does not require any additional hyper-parameters other than the pruning ratio and the desired failure probability.
- Given these sole two parameters, the approach automatically allocates the number of filters to sample for each layer.
- The authors' approach can be broadly applied to varying network architectures and data sets with minimal hyper-parameter tuning necessary.
- The authors envision that besides its immediate use for pruning state-of-the-art models, the approach can be used as a sub-procedure in other deep learning applications, e.g., for identifying winning lottery tickets (Frankle & Carbin, 2019) and for efficient architecture search (Liu et al, 2019b)

Summary

## Introduction:

Modern networks with millions of parameters require excessive amounts of memory and computational resources to store and conduct inference.- A common practice to obtain small, efficient network architectures is to train an over-parameterized network, prune it by removing the least significant weights, and re-train the pruned network (Gale et al, 2019; Frankle & Carbin, 2019; Han et al, 2015; Baykal et al, 2019b)
- This prune-retrain cycle is often repeated iteratively until the network cannot be pruned any further without incurring a significant loss in predictive accuracy relative to that of the original model.
- A diverse set of smart pruning strategies have been proposed in order to generate compact, accurate neural network models in a computationally efficient way
## Methods:

Ours Ours Ours He et al (2018) (SoftNet) He et al (2019) Dong et al (2017) Ours Ours He et al (2018) (SoftNet) Luo et al (2017) (ThiNet) He et al (2019) He et al (2017) Luo & Wu (2018) Liu et al (2019a) Ours Ours He et al (2018) (SoftNet) He et al (2019) Ye et al (2018)

Top-5 Err. (%) Orig.- The authors consider pruning the networks using the standard iterative prune-retrain procedure as before with only a limited number of iterations (2-3 iterations per reported experiment).
- The results of the evaluations are reported in Table 6 with respect to the following metrics: the resulting error of the pruned network (Pruned Err.), the difference in model classification error (Err. Diff), the percentage of parameters pruned (PR), and the FLOP Reduction (FR).
- Ours Ours Ours He et al (2018) (SoftNet) He et al (2019) Ye et al (2018) Lin et al (2020) Ours Ours Ours Li et al (2016) (FT) He et al (2018) (SoftNet) He et al (2019) He et al (2017) Li et al (2019) Lin et al (2020) Ours Ours Ours Li et al (2016) (FT) He et al (2018) (SoftNet) He et al (2019) Dong et al (2017) Ours Ours Ours Li et al (2016) (FT) Huang et al (2018) He et al (2019) Li et al (2019)
## Results:

The authors evaluate and compare the algorithm’s performance to that of state-of-the-art pruning schemes in generating compact networks that retain the predictive accuracy of the original model.- The authors' evaluations show that the approach generates significantly smaller and more efficient models compared to those generated by competing methods.
- The authors' results demonstrate the practicality and wide-spread applicability of the proposed approach: across all of the experiments, the algorithm took on the order of a minute to prune a given network5, required no manual tuning of its hyperparameters, and performed consistently well across a diverse set of pruning scenarios.
- Additional results, comparisons, and experimental details can be found in Sec. E of the appendix
## Conclusion:

In addition to the favorable empirical results of the algorithm, the approach exhibits various advantages over competing methods that manifest themselves in the empirical evaluations.- The authors' algorithm does not require any additional hyper-parameters other than the pruning ratio and the desired failure probability.
- Given these sole two parameters, the approach automatically allocates the number of filters to sample for each layer.
- The authors' approach can be broadly applied to varying network architectures and data sets with minimal hyper-parameter tuning necessary.
- The authors envision that besides its immediate use for pruning state-of-the-art models, the approach can be used as a sub-procedure in other deep learning applications, e.g., for identifying winning lottery tickets (Frankle & Carbin, 2019) and for efficient architecture search (Liu et al, 2019b)

- Table1: The prune ratio (PR) and the corrithms is strictly better for a wide range of target prune ratios. responding test error (Err.) of the sparsest For LeNet-5
- Table2: Overview of the pruning performance of each algorithm for various CNN architectures. For each algorithm and network architecture, the table reports the prune ratio (PR, %) and pruned Flops ratio (FR, %) of pruned models when achieving test accuracy within 0.5% of the original network’s test accuracy (or the closest result when the desired test accuracy was not achieved for the range of tested PRs). Our results indicate that our pruning algorithm generates smaller and more efficient networks with minimal loss in accuracy, when compared to competing approaches
- Table3: We report the hyperparameters used during MNIST training, pruning, and fine-tuning for the LeNet architectures. LR hereby denotes the learning rate and LR decay denotes the learning rate decay that we deploy after a certain number of epochs. During fine-tuning we used the same hyperparameters except for the ones indicated in the lower part of the table
- Table4: We report the hyperparameters used during training, pruning, and fine-tuning for various convolutional architectures on CIFAR-10. LR hereby denotes the learning rate and LR decay denotes the learning rate decay that we deploy after a certain number of epochs. During fine-tuning we used the same hyperparameters except for the ones indicated in the lower part of the table. {30, . . .} denotes that the learning rate is decayed every 30 epochs
- Table5: The hyper-parameters used for training and pruning residual networks trained on the ImageNet data set
- Table6: Comparisons of the performance of various pruning algorithms on ResNets trained on ImageNet (<a class="ref-link" id="cRussakovsky_et+al_2015_a" href="#rRussakovsky_et+al_2015_a">Russakovsky et al, 2015</a>). The reported results for the competing algorithms were taken directly from the corresponding papers. For each network architecture, the best performing algorithm for each evaluation metric, i.e., Pruned Err., Err. Diff, PR, and FR, is shown in bold
- Table7: We report the hyperparameters used for training as our experimental evaluations have shown that and pruning the driving network of <a class="ref-link" id="cAmini_et+al_2018_a" href="#rAmini_et+al_2018_a">Amini et al (2018</a>)
- Table8: The performance of our algorithm and that of state-of-the-art filter pruning algorithms on modern CNN architectures trained on CIFAR-10. The reported results for the competing algorithms were taken directly from the corresponding papers. For each network architecture, the best performing algorithm for each evaluation metric, i.e., Pruned Err., Err. Diff, PR, and FR, is shown in bold. The results show that our algorithm consistently outperforms state-of-the-art pruning approaches in nearly all of the relevant pruning metrics

Related work

- General network compression The need to tame the excessive storage requirements and costly inference associated with large, over-parameterized networks has led to a rich body of work in network pruning and compression. These approaches range from those inspired by classical tensor decompositions (Yu et al, 2017b; Jaderberg et al, 2014; Denton et al, 2014), and random projections and hashing (Arora et al, 2018; Ullrich et al, 2017; Chen et al, 2015; Weinberger et al, 2009; Shi et al, 2009) that compress a pre-trained network, to those approaches that enable sparsity by embedding sparsity as an objective directly in the training process (Ioannou et al, 2015; Alvarez & Salzmann, 2017) or exploit tensor structure to induce sparsity (Choromanska et al, 2016; Zhao et al, 2017). Overall, the predominant drawback of these methods is that they require laborious hyperparameter tuning, lack rigorous theoretical guarantees on the size and performance of the resulting compressed network, and/or conduct compression in a data oblivious way.

Weight-based pruning A large subset of modern pruning algorithms fall under the general approach of pruning individual weights of the network by assigning each weight a saliency score, e.g., its magnitude (Han et al, 2015), and subsequently inducing sparsity by deterministically removing those weights below a certain saliency score threshold (Guo et al, 2016; Han et al, 2015; Lee et al, 2019; LeCun et al, 1990). These approaches are heuristics that do not provide any theoretical performance guarantees and generally require – with the exception of (Lee et al, 2019) – computationally expensive train-prune-retrain cycles and tedious hyper-parameter tuning. Unlike our approach that enables accelerated inference (i.e., reduction in FLOPS) on any hardware and with any deep learning library by generating a smaller subnetwork, weight-based pruning generates a model with non-structured sparsity that requires specialized hardware and sparse linear algebra libraries in order to speed up inference.

Funding

- This research was supported in part by the U.S National Science Foundation (NSF) under Awards 1723943 and 1526815, Office of Naval Research (ONR) Grant N00014-18-1-2830, Microsoft, and JP Morgan Chase

Reference

- Dimitris Achlioptas, Zohar Karnin, and Edo Liberty. Matrix entry-wise sampling: Simple is best. Submitted to KDD, 2013(1.1):1–4, 2013.
- Jose M Alvarez and Mathieu Salzmann. Compression-aware training of deep networks. In Advances in Neural Information Processing Systems, pp. 856–867, 2017.
- Alexander Amini, Liam Paull, Thomas Balch, Sertac Karaman, and Daniela Rus. Learning steering bounds for parallel autonomous systems. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE, 2018.
- Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pp. 254–263, 2018.
- Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
- Cenk Baykal, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. Data-dependent coresets for compressing neural networks with applications to generalization bounds. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=HJfwJ2A5KX.
- Cenk Baykal, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. Sipping neural networks: Sensitivity-informed provable pruning of neural networks. arXiv preprint arXiv:1910.05422, 2019b.
- Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889, 2016.
- Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International conference on machine learning, pp. 2285–2294, 2015.
- Anna Choromanska, Krzysztof Choromanski, Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, and Yann LeCun. Binary embeddings with structured hashed projections. In International Conference on Machine Learning, pp. 344–353, 2016.
- Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277, 2014.
- Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5840–5848, 2017.
- Petros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsification via a matrixvalued bernstein inequality. Information Processing Letters, 111(8):385–389, 2011.
- Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 569–578. ACM, 2011.
- Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387, 2016.
- Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015. URL http://arxiv.org/abs/1510.00149.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2234–2240. AAAI Press, 2018.
- Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349, 2019.
- Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397, 2017.
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Qiangui Huang, Kevin Zhou, Suya You, and Ulrich Neumann. Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 709–718. IEEE, 2018.
- Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744, 2015.
- Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
- Abhisek Kundu and Petros Drineas. A note on randomized element-wise matrix sparsification. arXiv preprint arXiv:1404.0320, 2014.
- Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605, 1990.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Yawei Li, Shuhang Gu, Luc Van Gool, and Radu Timofte. Learning filter basis for convolutional neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5623–5632, 2019.
- Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJem8lSFwB.
- Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3296–3305, 2019a.
- Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=rJlnB3C5Ym.
- Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. arXiv preprint arXiv:1805.08941, 2018.
- Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066, 2017.
- Shannon McCurdy. Ridge regression and provable deterministic ridge leverage score sampling. In Advances in Neural Information Processing Systems, pp. 2463–2472, 2018.
- Dimitris Papailiopoulos, Anastasios Kyrillidis, and Christos Boutsidis. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 997–1006. ACM, 2014.
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Konstantinos Pitas, Mike Davies, and Pierre Vandergheynst. Revisiting hard thresholding for dnn pruning. arXiv preprint arXiv:1905.08793, 2019.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research, 10(Nov):2615–2637, 2009.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1–230, 2015.
- Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
- Ramon van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.
- Roman Vershynin. High-dimensional probability. An Introduction with Applications, 2016.
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning, pp. 1113–1120, 2009.
- Jianbo Ye, Xin Lu, Zhe Lin, and James Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJ94fqApW.
- Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, ChingYung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. Preprint at https://arxiv.org/abs/1711.05908, 2017a.
- Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203, 2018.
- Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379, 2017b.
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. Theoretical properties for neural networks with weight matrices of low displacement rank. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4082–4090. JMLR. org, 2017.
- Weight-based pruning A large subset of modern pruning algorithms fall under the general approach of pruning individual weights of the network by assigning each weight a saliency score, e.g., its magnitude (Han et al., 2015), and subsequently inducing sparsity by deterministically removing those weights below a certain saliency score threshold (Guo et al., 2016; Han et al., 2015; Lee et al., 2019; LeCun et al., 1990). These approaches are heuristics that do not provide any theoretical performance guarantees and generally require – with the exception of (Lee et al., 2019) – computationally expensive train-prune-retrain cycles and tedious hyper-parameter tuning. Unlike our approach that enables accelerated inference (i.e., reduction in FLOPS) on any hardware and with any deep learning library by generating a smaller subnetwork, weight-based pruning generates a model with non-structured sparsity that requires specialized hardware and sparse linear algebra libraries in order to speed up inference.
- Neuron pruning Pruning entire neurons in FNNs and filters in CNNs is particularly appealing as it shrinks the network into its slimmer counterpart, which leads to alleviated storage requirements and improved inference-time performance on any hardware. Similar to the weight-based approaches, approaches in this domain assign an importance score to each neuron or filter and remove those with a score below a certain threshold (He et al., 2018; Li et al., 2016; Yu et al., 2017a). These approaches generally take the p norm –with p = {1, 2} as popular choices– of the filters to assign filter importance and subsequently prune unimportant filers. These methods are data-oblivious heuristics that heavily rely on the assumption that filters with large weight magnitudes are more important, which may not hold in general (Ye et al., 2018).
- Our work is most similar to that of (Baykal et al., 2019a;b), which proposed an weight pruning algorithm with provable guarantees that samples weights of the network in accordance to an empirical notion of parameter importance. The main drawback of their approach is the limited applicability to only fully-connected networks, and the lack of inference-time acceleration due to non-structured sparsity caused by removing individual weights. Our method is also sampling-based and relies on a data-informed notion of importance, however, unlike (Baykal et al., 2019a;b), our approach can be applied to both FNNs and CNNs and generates sparse, efficient subnetworks that accelerate inference.
- Definition 1 (Edge Sensitivity (Baykal et al., 2019a)). Fixing a layer ∈ [L], let wij+1 be the weight of edge (j, i) ∈ [η ] × [η +1]. The empirical sensitivity of weight entry wij+1 with respect to input x ∈ X is defined to be gij+1(x)

Full Text

Tags

Comments