AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Similar to the results in past, we find that WoodTaylor is significantly better than magnitude or diagonal Fisher based pruning as well as the global version of magnitude pruning
WoodFisher: Efficient Second-Order Approximation for Neural Network Compression
NIPS 2020, (2020)
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest in utilizing this information in the context of deep neural networks; however, relatively little is known about the quality of existing approximations ...More
PPT (Upload PPT)
- The recent success of deep learning, e.g. [1, 2], has brought about significant accuracy improvement in areas such as computer vision [3, 4] or natural language processing [5, 6].
- [1, 2], has brought about significant accuracy improvement in areas such as computer vision [3, 4] or natural language processing [5, 6]
- Central to this performance progression has been the size of the underlying models, with millions or even billions of trainable parameters [4, 5], a trend which seems likely to continue for the forseeable future .
- This is often referred to as the local quadratic model for the loss and is given by L(w + δw) ≈ L(w) + ∇wL δw
- The recent success of deep learning, e.g. [1, 2], has brought about significant accuracy improvement in areas such as computer vision [3, 4] or natural language processing [5, 6]
- We apply WoodFisher to compress commonly used convolutional neural networks (CNNs) for image classification. We consider both the one-shot pruning case, and the gradual case, as well as investigate how the block-wise assumptions and the number of samples used for estimating Fisher affect the quality of the approximation, and whether this can lead to more accurate pruned models
- We find that the global magnitude pruning is worse than WoodFisher-independent until about 60% sparsity, beyond which it is likely that adjusting layer-wise sparsity is essential
- Similar to the results in past, we find that WoodTaylor is significantly better than magnitude or diagonal Fisher based pruning as well as the global version of magnitude pruning
- Some of the many interesting directions to apply WoodFisher, include, e.g., structured pruning which can be facilitated by the OBD framework, compressing popular models used in NLP like Transformers , providing efficient Inverse-Hessian Vector Products (IHVP) estimates for influence functions , etc
- Our work revisits the theoretical underpinnings of neural network pruning, and shows that foundational work can be successfully extended to large-scale settings, yielding state-of-the-art results
- DSR  Incremental  DPF  GMP + LS  Variational Dropout  RIGL + ERK  SNFS + LS  STR  Global Magnitude DNW  WoodFisher.
- GMP + LS  Variational Dropout  RIGL + ERK  SNFS + LS  STR  Global Magnitude DNW  WoodFisher.
- The authors apply WoodFisher to compress commonly used CNNs for image classification. The authors consider both the one-shot pruning case, and the gradual case, as well as investigate how the block-wise assumptions and the number of samples used for estimating Fisher affect the quality of the approximation, and whether this can lead to more accurate pruned models.
5.1 One-shot pruning
Assume that the authors are given a pre-trained neural network which the authors would like to sparsify in a single step, without any re-training.
- The authors apply WoodFisher to compress commonly used CNNs for image classification
- The authors consider both the one-shot pruning case, and the gradual case, as well as investigate how the block-wise assumptions and the number of samples used for estimating Fisher affect the quality of the approximation, and whether this can lead to more accurate pruned models.
- A dampening of 1e − 1 was used for WoodTaylor, and Figures S13,S14 of the Appendix S8.1 present relevant ablation studies
- The decision to remove parameters based on their pruning statistic can be made either independently for every layer, or jointly across the whole network.
- The latter option allows them to automatically adjust the sparsity distribution across the various layers given a global sparsity target for the network.
- The authors hope that the findings can provide further momentum to the investigation of second-order properties of neural networks, and be extended to applications beyond compression
- Table1: Comparison of WoodFisher gradual pruning results with the state-of-the-art approaches. WoodFisher and Global Magnitude results are averaged over two runs. For their best scores, please refer to Table S4. LS denotes label smoothing, and ERK denotes the Erdos-Renyi Kernel
- Table2: Comparison with state-of-the-art DPF [<a class="ref-link" id="c24" href="#r24">24</a>] in a more commensurate setting by starting from a similarly trained dense baseline. The numbers for Incremental & SNFS are taken from [<a class="ref-link" id="c24" href="#r24">24</a>]
- Table3: Comparison of WoodFisher, Global Magnitude and Magnitude (GMP) pruning for RESNET-50 on IMAGENET. Namely, this involves skipping the (input) first convolutional layer, and pruning the last fully connected layer to 80%. The rest of the layers are pruned to an overall target of 90%. So, here GMP prunes all of them uniformly to 90%, while WoodFisher and Global Magnitude use their obtained sparsity distributions for intermediate layers at the same overall target of 90%
- Table4: Comparison of WoodFisher gradual pruning results for MobileNetV1 on ImageNet in 75% and 90% sparsity regime. (α) next to Incremental [<a class="ref-link" id="c19" href="#r19">19</a>] is to highlight that the first convolutional and all depthwise convolutional layers are kept dense, unlike the other shown methods. The obtained sparsity distribution and other details can be found in the section S6.2
- Table5: WoodFisher and Global Magnitude gradual pruning results, reported with mean and standard deviation across 5 runs, for MobileNetV1 on ImageNet in 75% and 90% sparsity regime. We also run a two-sided student’s t-test to check if WoodFisher significantly outperforms Global Magnitude, which we find to be true at a significance level of α = 0.05
- Table6: Comparison of WoodFisher (WF) and STR across theoretical FLOP counts for RESNET50 and MOBILENETV1 on IMAGENET in 90% and 75% sparsity regime respectively
- Table7: Comparison of inference times at batch sizes 1 and 64 for various sparse models, executed on the framework of [<a class="ref-link" id="c29" href="#r29">29</a>], on an 18-core Haswell CPU. The table also contains the Top-1 Accuracy for the model on the ILSVRC validation set
- Table8: Comparison of one-shot pruning performance of WoodFisher, when the considered Fisher matrix is empirical Fisher or one-sample approximation to true Fisher, for RESNET-50 on IMAGENET. The results are averaged over three seeds
- Table9: Detailed hyperparameters for the gradual pruning results presented in Tables 1, 4
- Table10: Effect of fisher subsample size and fisher mini-batch size on one-shot pruning performance of WoodFisher, for RESNET-50 on IMAGENET. A chunk size of 1000 was used for this experiment. The resutlts are averaged over three seeds
- Table11: Effect of fisher subsample size and fisher mini-batch size on one-shot pruning performance of WoodFisher, for MOBILENETV1 on IMAGENET. A chunk size of 10, 000 was used for this experiment. The results are averaged over three seeds
- Table12: At best runs: Comparison of WoodFisher gradual pruning results with the state-of-the-art approaches. LS denotes label smoothing, and ERK denotes the Erdos-Renyi Kernel
- Table13: At best runs: comparison of WoodFisher and magnitude pruning with the same layer-wise sparsity targets as used in [<a class="ref-link" id="c18" href="#r18">18</a>] for RESNET-50 on IMAGENET. Namely, this involves skipping the first convolutional layer, pruning the last fully connected layer to 80% and the rest of the layers equally to 90%
- Table14: The obtained distribution of sparsity across the layers by WoodFisher and Global Magnitude when sparsifying RESNET-50 to 80%, 90%, 95%, 98% levels on IMAGENET
- Table15: The obtained distribution of sparsity across the layers by WoodFisher and Global Magnitude when sparsifying MOBILENETV1 to 75%, 89% levels on IMAGENET
- Table16: The obtained distribution of sparsity across the layers by WoodFisher when sparsifying MOBILENETV1 to 75.28% sparsity level with FLOPs-aware hyperparameter β = 0.00, 0.30, 0.35 on IMAGENET
- This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
Study subjects and analysis
However, this rests on the assumption that each individual gradient vanishes for welloptimized networks, which we did not find to hold in our experiments. Further, they argue that a large number of samples are needed for the empirical Fisher to serve as a good approximation—in our experiments, we find that a few hundred samples suffice for our applications (e.g. Figure 1).
We choose three layers from different stages of a pre-trained RESNET-20 on CIFAR10, and Figure 2 presents these results. In all three cases, the local quadratic model using WoodFisher predicts an accurate approximation to the actual underlying loss. Further, it is possible to use the dampening λ to control whether a more conservative or relaxed estimate is needed
However, this rests on the assumption that each individual gradient vanishes for welloptimized networks, which we did not find to hold in our experiments. Further, they argue that a large number of samples are needed for the empirical Fisher to serve as a good approximation—in our experiments, we find that a few hundred samples suffice for our applications (e.g. Figure 1). 4 Model Compression
Finally, diagonal-Fisher performs worse than magnitude pruning for sparsity levels higher than 30%. This finding was consistent, and so we omit it in the sections ahead. (We used 16,000 samples to estimate the diagonal Fisher, whereas WoodFisher performs well even with 1,000 samples.). EFFECT OF CHUNK SIZE
Figure 5b illustrates these results for the ‘joint’ pruning mode (however, similar results can observed in the ‘independent’ mode as well). The number of samples used for estimating the inverse is the same across K-FAC and WoodFisher (i.e., 50, 000 samples)2. This highlights the better approximation quality provided by WoodFisher, which unlike K-FAC does not make major assumptions
2For the sake of efficiency, in the case of WoodFisher, we utilize 1000 averaged gradients over a mini-batch size of 50. But, even if no mini-batching is considered and say, we considered 1000 samples for both, we notice similar gains over K-FAC. DSR  Incremental
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. doi: 10.1109/cvpr. 2016.90. URL http://dx.doi.org/10.1109/CVPR.2016.90.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
- Alec Radford. Improving language understanding by generative pre-training. 2018.
- Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ArXiv, October 2019.
- URL https://www.microsoft.com/en-us/research/publication/
-  Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 598–605. Morgan-Kaufmann, 1990. URL http://papers.nips.cc/paper/250-optimal-brain-damage.pdf.
-  Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.
-  Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan-Kaufmann, 1993. URL http://papers.nips.cc/paper/647-second-order-derivatives-for-network-pruning-optimal-brain-surgeon.pdf.
-  Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
-  Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
-  John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2010.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
-  James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature, 2015.
-  Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions, 2017.
-  James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. ISSN 0027-8424. doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/content/114/13/3521.
-  Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks, 2019.
-  Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017.
-  Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and fisher pruning, 2018.
-  Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019.
-  Wenyuan Zeng and Raquel Urtasun. MLPrune: Multi-layer pruning for automated neural network compression, 2019. URL https://openreview.net/forum?id=r1g5b2RcKm.
-  Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance, 2019.
-  Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJem8lSFwB.
-  Max A. Woodbury. Inverting modified matrices. SRG Memorandum report; 42. Princeton, NJ: Department of Statistics, Princeton University, 1950.
-  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv, abs/1704.04861, 2017.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham M. Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. ArXiv, abs/2002.03231, 2020.
-  Neural Magic Inc. Early access signup for the sparse inference engine, 2020. URL https://neuralmagic.com/earlyaccess/.
-  James Martens. New insights and perspectives on the natural gradient method, 2014.
-  Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical fisher approximation for natural gradient descent, 2019.
-  Sidak Pal Singh. Efficient second-order methods for model compression. EPFL Master Thesis, 2020. URL http://infoscience.epfl.ch/record/277227.
-  Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998. doi: 10.1162/089976698300017746. URL https://doi.org/10.1162/089976698300017746.
-  Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. doi: 10.1162/08997660260028683. URL https://doi.org/10.1162/08997660260028683.
-  Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Mangazol, Yoshua Bengio, and Nicolas Le Roux. On the interplay between noise and curvature and its effect on optimization and generalization, 2019.
-  Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In NIPS2007, January 2008. URL https://www.microsoft.com/en-us/research/publication/topmoumoute-online-natural-gradient-algorithm/.
-  Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon, 2017.
-  James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 735–742, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077.
-  Shankar Krishnan, Ying Xiao, and Rif. A. Saurous. Neumann optimizer: A practical optimization algorithm for deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkLyJl-0-.
-  Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time, 2016.
-  Tom Heskes. On natural learning and pruning in multilayered perceptrons. Neural Computation, 12: 881–901, 2000.
-  Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kroneckerfactored approximations. 2016.
-  Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Largescale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.01264. URL http://dx.doi.org/10.1109/CVPR.2019.01264.
-  Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers, 2016.
-  James Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyMTkQZAb.
-  César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization, 2018. URL https://openreview.net/forum?id=ryVC6tkwG.
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015.
-  Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. ArXiv, abs/1911.11134, 2019.
-  Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks, 2017.
-  Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization, 2017.
-  Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2018.
-  Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, 2019.
-  Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. Discovering neural wirings, 2019.
-  Microsoft Corporation. The onnx runtime, 2002.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
-  Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, January 1994. ISSN 0899-7667. doi: 10.1162/neco.1918.104.22.168. URL https://doi.org/10.1162/neco.1922.214.171.124.
-  Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. Secondorder optimization for deep reinforcement learning using kronecker-factored approximation. In NIPS, pages 5285–5294, 2017. URL http://papers.nips.cc/paper/7112-second-order-optimization-for-deep-reinforcement-learning-using-kronecker-factored-approximation.