AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Similar to the results in past, we find that WoodTaylor is significantly better than magnitude or diagonal Fisher based pruning as well as the global version of magnitude pruning

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

NIPS 2020, (2020)

Cited by: 0|Views36
EI
Full Text
Bibtex
Weibo

Abstract

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest in utilizing this information in the context of deep neural networks; however, relatively little is known about the quality of existing approximations ...More
0
Introduction
  • The recent success of deep learning, e.g. [1, 2], has brought about significant accuracy improvement in areas such as computer vision [3, 4] or natural language processing [5, 6].
  • [1, 2], has brought about significant accuracy improvement in areas such as computer vision [3, 4] or natural language processing [5, 6]
  • Central to this performance progression has been the size of the underlying models, with millions or even billions of trainable parameters [4, 5], a trend which seems likely to continue for the forseeable future [7].
  • This is often referred to as the local quadratic model for the loss and is given by L(w + δw) ≈ L(w) + ∇wL δw
Highlights
  • The recent success of deep learning, e.g. [1, 2], has brought about significant accuracy improvement in areas such as computer vision [3, 4] or natural language processing [5, 6]
  • We apply WoodFisher to compress commonly used convolutional neural networks (CNNs) for image classification. We consider both the one-shot pruning case, and the gradual case, as well as investigate how the block-wise assumptions and the number of samples used for estimating Fisher affect the quality of the approximation, and whether this can lead to more accurate pruned models
  • We find that the global magnitude pruning is worse than WoodFisher-independent until about 60% sparsity, beyond which it is likely that adjusting layer-wise sparsity is essential
  • Similar to the results in past, we find that WoodTaylor is significantly better than magnitude or diagonal Fisher based pruning as well as the global version of magnitude pruning
  • Some of the many interesting directions to apply WoodFisher, include, e.g., structured pruning which can be facilitated by the OBD framework, compressing popular models used in NLP like Transformers [55], providing efficient Inverse-Hessian Vector Products (IHVP) estimates for influence functions [16], etc
  • Our work revisits the theoretical underpinnings of neural network pruning, and shows that foundational work can be successfully extended to large-scale settings, yielding state-of-the-art results
Methods
  • DSR [52] Incremental [19] DPF [24] GMP + LS [18] Variational Dropout [49] RIGL + ERK [48] SNFS + LS [23] STR [28] Global Magnitude DNW [53] WoodFisher.
  • GMP + LS [18] Variational Dropout [49] RIGL + ERK [48] SNFS + LS [23] STR [28] Global Magnitude DNW [53] WoodFisher.
Results
  • The authors apply WoodFisher to compress commonly used CNNs for image classification. The authors consider both the one-shot pruning case, and the gradual case, as well as investigate how the block-wise assumptions and the number of samples used for estimating Fisher affect the quality of the approximation, and whether this can lead to more accurate pruned models.

    5.1 One-shot pruning

    Assume that the authors are given a pre-trained neural network which the authors would like to sparsify in a single step, without any re-training.
  • The authors apply WoodFisher to compress commonly used CNNs for image classification
  • The authors consider both the one-shot pruning case, and the gradual case, as well as investigate how the block-wise assumptions and the number of samples used for estimating Fisher affect the quality of the approximation, and whether this can lead to more accurate pruned models.
  • A dampening of 1e − 1 was used for WoodTaylor, and Figures S13,S14 of the Appendix S8.1 present relevant ablation studies
Conclusion
  • The decision to remove parameters based on their pruning statistic can be made either independently for every layer, or jointly across the whole network.
  • The latter option allows them to automatically adjust the sparsity distribution across the various layers given a global sparsity target for the network.
  • The authors hope that the findings can provide further momentum to the investigation of second-order properties of neural networks, and be extended to applications beyond compression
Tables
  • Table1: Comparison of WoodFisher gradual pruning results with the state-of-the-art approaches. WoodFisher and Global Magnitude results are averaged over two runs. For their best scores, please refer to Table S4. LS denotes label smoothing, and ERK denotes the Erdos-Renyi Kernel
  • Table2: Comparison with state-of-the-art DPF [<a class="ref-link" id="c24" href="#r24">24</a>] in a more commensurate setting by starting from a similarly trained dense baseline. The numbers for Incremental & SNFS are taken from [<a class="ref-link" id="c24" href="#r24">24</a>]
  • Table3: Comparison of WoodFisher, Global Magnitude and Magnitude (GMP) pruning for RESNET-50 on IMAGENET. Namely, this involves skipping the (input) first convolutional layer, and pruning the last fully connected layer to 80%. The rest of the layers are pruned to an overall target of 90%. So, here GMP prunes all of them uniformly to 90%, while WoodFisher and Global Magnitude use their obtained sparsity distributions for intermediate layers at the same overall target of 90%
  • Table4: Comparison of WoodFisher gradual pruning results for MobileNetV1 on ImageNet in 75% and 90% sparsity regime. (α) next to Incremental [<a class="ref-link" id="c19" href="#r19">19</a>] is to highlight that the first convolutional and all depthwise convolutional layers are kept dense, unlike the other shown methods. The obtained sparsity distribution and other details can be found in the section S6.2
  • Table5: WoodFisher and Global Magnitude gradual pruning results, reported with mean and standard deviation across 5 runs, for MobileNetV1 on ImageNet in 75% and 90% sparsity regime. We also run a two-sided student’s t-test to check if WoodFisher significantly outperforms Global Magnitude, which we find to be true at a significance level of α = 0.05
  • Table6: Comparison of WoodFisher (WF) and STR across theoretical FLOP counts for RESNET50 and MOBILENETV1 on IMAGENET in 90% and 75% sparsity regime respectively
  • Table7: Comparison of inference times at batch sizes 1 and 64 for various sparse models, executed on the framework of [<a class="ref-link" id="c29" href="#r29">29</a>], on an 18-core Haswell CPU. The table also contains the Top-1 Accuracy for the model on the ILSVRC validation set
  • Table8: Comparison of one-shot pruning performance of WoodFisher, when the considered Fisher matrix is empirical Fisher or one-sample approximation to true Fisher, for RESNET-50 on IMAGENET. The results are averaged over three seeds
  • Table9: Detailed hyperparameters for the gradual pruning results presented in Tables 1, 4
  • Table10: Effect of fisher subsample size and fisher mini-batch size on one-shot pruning performance of WoodFisher, for RESNET-50 on IMAGENET. A chunk size of 1000 was used for this experiment. The resutlts are averaged over three seeds
  • Table11: Effect of fisher subsample size and fisher mini-batch size on one-shot pruning performance of WoodFisher, for MOBILENETV1 on IMAGENET. A chunk size of 10, 000 was used for this experiment. The results are averaged over three seeds
  • Table12: At best runs: Comparison of WoodFisher gradual pruning results with the state-of-the-art approaches. LS denotes label smoothing, and ERK denotes the Erdos-Renyi Kernel
  • Table13: At best runs: comparison of WoodFisher and magnitude pruning with the same layer-wise sparsity targets as used in [<a class="ref-link" id="c18" href="#r18">18</a>] for RESNET-50 on IMAGENET. Namely, this involves skipping the first convolutional layer, pruning the last fully connected layer to 80% and the rest of the layers equally to 90%
  • Table14: The obtained distribution of sparsity across the layers by WoodFisher and Global Magnitude when sparsifying RESNET-50 to 80%, 90%, 95%, 98% levels on IMAGENET
  • Table15: The obtained distribution of sparsity across the layers by WoodFisher and Global Magnitude when sparsifying MOBILENETV1 to 75%, 89% levels on IMAGENET
  • Table16: The obtained distribution of sparsity across the layers by WoodFisher when sparsifying MOBILENETV1 to 75.28% sparsity level with FLOPs-aware hyperparameter β = 0.00, 0.30, 0.35 on IMAGENET
Download tables as Excel
Funding
  • This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
Study subjects and analysis
samples: 100
However, this rests on the assumption that each individual gradient vanishes for welloptimized networks, which we did not find to hold in our experiments. Further, they argue that a large number of samples are needed for the empirical Fisher to serve as a good approximation—in our experiments, we find that a few hundred samples suffice for our applications (e.g. Figure 1).

cases: 3
We choose three layers from different stages of a pre-trained RESNET-20 on CIFAR10, and Figure 2 presents these results. In all three cases, the local quadratic model using WoodFisher predicts an accurate approximation to the actual underlying loss. Further, it is possible to use the dampening λ to control whether a more conservative or relaxed estimate is needed

samples: 100
However, this rests on the assumption that each individual gradient vanishes for welloptimized networks, which we did not find to hold in our experiments. Further, they argue that a large number of samples are needed for the empirical Fisher to serve as a good approximation—in our experiments, we find that a few hundred samples suffice for our applications (e.g. Figure 1). 4 Model Compression

samples: 16000
Finally, diagonal-Fisher performs worse than magnitude pruning for sparsity levels higher than 30%. This finding was consistent, and so we omit it in the sections ahead. (We used 16,000 samples to estimate the diagonal Fisher, whereas WoodFisher performs well even with 1,000 samples.). EFFECT OF CHUNK SIZE

samples: 50000
Figure 5b illustrates these results for the ‘joint’ pruning mode (however, similar results can observed in the ‘independent’ mode as well). The number of samples used for estimating the inverse is the same across K-FAC and WoodFisher (i.e., 50, 000 samples)2. This highlights the better approximation quality provided by WoodFisher, which unlike K-FAC does not make major assumptions

samples: 1000
2For the sake of efficiency, in the case of WoodFisher, we utilize 1000 averaged gradients over a mini-batch size of 50. But, even if no mini-batching is considered and say, we considered 1000 samples for both, we notice similar gains over K-FAC. DSR [52] Incremental

Reference
  • Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf.
    Locate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. doi: 10.1109/cvpr. 2016.90. URL http://dx.doi.org/10.1109/CVPR.2016.90.
    Locate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
    Google ScholarFindings
  • Alec Radford. Improving language understanding by generative pre-training. 2018.
    Google ScholarFindings
  • Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ArXiv, October 2019.
    Google ScholarLocate open access versionFindings
  • URL https://www.microsoft.com/en-us/research/publication/
    Findings
  • [8] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 598–605. Morgan-Kaufmann, 1990. URL http://papers.nips.cc/paper/250-optimal-brain-damage.pdf.
    Locate open access versionFindings
  • [9] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.
    Google ScholarLocate open access versionFindings
  • [10] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan-Kaufmann, 1993. URL http://papers.nips.cc/paper/647-second-order-derivatives-for-network-pruning-optimal-brain-surgeon.pdf.
    Locate open access versionFindings
  • [11] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
    Findings
  • [12] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
    Google ScholarLocate open access versionFindings
  • [13] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2010.
    Google ScholarLocate open access versionFindings
  • [14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
    Google ScholarFindings
  • [15] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature, 2015.
    Google ScholarFindings
  • [16] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions, 2017.
    Google ScholarFindings
  • [17] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. ISSN 0027-8424. doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/content/114/13/3521.
    Locate open access versionFindings
  • [18] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks, 2019.
    Google ScholarFindings
  • [19] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017.
    Google ScholarFindings
  • [20] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and fisher pruning, 2018.
    Google ScholarFindings
  • [21] Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019.
    Google ScholarFindings
  • [22] Wenyuan Zeng and Raquel Urtasun. MLPrune: Multi-layer pruning for automated neural network compression, 2019. URL https://openreview.net/forum?id=r1g5b2RcKm.
    Findings
  • [23] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance, 2019.
    Google ScholarFindings
  • [24] Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJem8lSFwB.
    Locate open access versionFindings
  • [25] Max A. Woodbury. Inverting modified matrices. SRG Memorandum report; 42. Princeton, NJ: Department of Statistics, Princeton University, 1950.
    Google ScholarFindings
  • [26] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv, abs/1704.04861, 2017.
    Findings
  • [27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • [28] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham M. Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. ArXiv, abs/2002.03231, 2020.
    Findings
  • [29] Neural Magic Inc. Early access signup for the sparse inference engine, 2020. URL https://neuralmagic.com/earlyaccess/.
    Findings
  • [30] James Martens. New insights and perspectives on the natural gradient method, 2014.
    Google ScholarFindings
  • [31] Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical fisher approximation for natural gradient descent, 2019.
    Google ScholarFindings
  • [32] Sidak Pal Singh. Efficient second-order methods for model compression. EPFL Master Thesis, 2020. URL http://infoscience.epfl.ch/record/277227.
    Findings
  • [33] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998. doi: 10.1162/089976698300017746. URL https://doi.org/10.1162/089976698300017746.
    Locate open access versionFindings
  • [34] Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. doi: 10.1162/08997660260028683. URL https://doi.org/10.1162/08997660260028683.
    Locate open access versionFindings
  • [35] Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Mangazol, Yoshua Bengio, and Nicolas Le Roux. On the interplay between noise and curvature and its effect on optimization and generalization, 2019.
    Google ScholarFindings
  • [36] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In NIPS2007, January 2008. URL https://www.microsoft.com/en-us/research/publication/topmoumoute-online-natural-gradient-algorithm/.
    Locate open access versionFindings
  • [37] Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon, 2017.
    Google ScholarFindings
  • [38] James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 735–742, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077.
    Google ScholarLocate open access versionFindings
  • [39] Shankar Krishnan, Ying Xiao, and Rif. A. Saurous. Neumann optimizer: A practical optimization algorithm for deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkLyJl-0-.
    Locate open access versionFindings
  • [40] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time, 2016.
    Google ScholarFindings
  • [41] Tom Heskes. On natural learning and pruning in multilayered perceptrons. Neural Computation, 12: 881–901, 2000.
    Google ScholarLocate open access versionFindings
  • [42] Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kroneckerfactored approximations. 2016.
    Google ScholarFindings
  • [43] Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Largescale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.01264. URL http://dx.doi.org/10.1109/CVPR.2019.01264.
    Locate open access versionFindings
  • [44] Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers, 2016.
    Google ScholarFindings
  • [45] James Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyMTkQZAb.
    Locate open access versionFindings
  • [46] César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization, 2018. URL https://openreview.net/forum?id=ryVC6tkwG.
    Findings
  • [47] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015.
    Google ScholarFindings
  • [48] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. ArXiv, abs/1911.11134, 2019.
    Findings
  • [49] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks, 2017.
    Google ScholarFindings
  • [50] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization, 2017.
    Google ScholarFindings
  • [51] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2018.
    Google ScholarFindings
  • [52] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, 2019.
    Google ScholarFindings
  • [53] Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. Discovering neural wirings, 2019.
    Google ScholarFindings
  • [54] Microsoft Corporation. The onnx runtime, 2002.
    Google ScholarFindings
  • [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
    Google ScholarFindings
  • [56] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, January 1994. ISSN 0899-7667. doi: 10.1162/neco.1994.6.1.147. URL https://doi.org/10.1162/neco.1994.6.1.147.
    Locate open access versionFindings
  • [57] Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. Secondorder optimization for deep reinforcement learning using kronecker-factored approximation. In NIPS, pages 5285–5294, 2017. URL http://papers.nips.cc/paper/7112-second-order-optimization-for-deep-reinforcement-learning-using-kronecker-factored-approximation.
    Locate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科