Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot

NIPS 2020, 2020.

被引用0|浏览36
EI
微博一下
We propose several sanity check methods on unstructured pruning methods that test whether the data used in the pruning step and whether the architecture of the pruned subnetwork are essential for the final performance

摘要

Network pruning is a method for reducing test-time computational resource requirements with minimal performance degradation. Conventional wisdom of pruning algorithms suggests that: (1) Pruning methods exploit information from training data to find good subnetworks; (2) The architecture of the pruned network is crucial for good performa...更多
0
下载 PDF 全文
引用
微博一下
简介
  • Deep neural networks have achieved great success in the overparameterized regime [1, 6, 7, 32, 39].
  • The subnetworks can be trained to achieve the same performance as initial tickets even when they are obtained by using corrupted data in the pruning step or rearranging the initial tickets layerwise.
  • Inspired by the results of sanity checks, the authors propose to choose a series of simple data-independent pruning ratios for each layer, and randomly prune each layer to obtain the subnetworks at initialization.
重点内容
  • Deep neural networks have achieved great success in the overparameterized regime [1, 6, 7, 32, 39]
  • Experimental results show that our zero-shot random tickets outperforms or attains similar performance compared to all existing “initial tickets”
  • We propose several sanity check methods (Section 3) on unstructured pruning methods that test whether the data used in the pruning step and whether the architecture of the pruned subnetwork are essential for the final performance
  • We find that one kind of pruning method, classified as “initial tickets” (Section 2.3) hardly exploit any information from data, because randomly changing the preserved weights of the subnetwork obtained by these methods layerwise does not affect the final performance
  • These findings inspire us to design a zero-shot data-independent pruning method called “random tickets” which outperforms or attains similar performance compared to initial tickets
  • We identify one existing pruning method that passes our sanity checks, and hybridize the random tickets with this method to propose a new method called “hybrid tickets”, which achieves further improvement
结果
  • Initial tickets: This kind of methods aim to find a subnetwork of the randomly-initialized network which the authors call “initial tickets” that can be trained to reach similar test performance of the original network.
  • The authors propose a novel layerwise rearranging strategy, which keeps the number of preserved weights in each layer but completely destroys the network architecture found by the pruning methods for each individual layer.
  • The authors' results suggest that the final performance of the retrained “initial tickets” does not drop when using corrupted data including random labels and random pixels in the pruning step.
  • The authors show the results of two weaker attacks, layerwise weight shuffling and half dataset since if learning rate rewinding can pass these checks, i.e., its performance degrades under these weaker attacks, it will naturally pass the previous sanity check with strong attacks used in Section 4.1.
  • As the results of the sanity checks on learning rate rewinding suggest, this method truly encodes information of data into the weights, and the architectures of the pruned subnetworks cannot be randomly changed without performance drop.
  • The authors propose several sanity check methods (Section 3) on unstructured pruning methods that test whether the data used in the pruning step and whether the architecture of the pruned subnetwork are essential for the final performance.
  • The authors find that one kind of pruning method, classified as “initial tickets” (Section 2.3) hardly exploit any information from data, because randomly changing the preserved weights of the subnetwork obtained by these methods layerwise does not affect the final performance.
结论
  • These findings inspire them to design a zero-shot data-independent pruning method called “random tickets” which outperforms or attains similar performance compared to initial tickets.
  • The authors identify one existing pruning method that passes the sanity checks, and hybridize the random tickets with this method to propose a new method called “hybrid tickets”, which achieves further improvement.
  • The authors' findings bring new insights in rethinking the key factors on the success of pruning algorithms.
总结
  • Deep neural networks have achieved great success in the overparameterized regime [1, 6, 7, 32, 39].
  • The subnetworks can be trained to achieve the same performance as initial tickets even when they are obtained by using corrupted data in the pruning step or rearranging the initial tickets layerwise.
  • Inspired by the results of sanity checks, the authors propose to choose a series of simple data-independent pruning ratios for each layer, and randomly prune each layer to obtain the subnetworks at initialization.
  • Initial tickets: This kind of methods aim to find a subnetwork of the randomly-initialized network which the authors call “initial tickets” that can be trained to reach similar test performance of the original network.
  • The authors propose a novel layerwise rearranging strategy, which keeps the number of preserved weights in each layer but completely destroys the network architecture found by the pruning methods for each individual layer.
  • The authors' results suggest that the final performance of the retrained “initial tickets” does not drop when using corrupted data including random labels and random pixels in the pruning step.
  • The authors show the results of two weaker attacks, layerwise weight shuffling and half dataset since if learning rate rewinding can pass these checks, i.e., its performance degrades under these weaker attacks, it will naturally pass the previous sanity check with strong attacks used in Section 4.1.
  • As the results of the sanity checks on learning rate rewinding suggest, this method truly encodes information of data into the weights, and the architectures of the pruned subnetworks cannot be randomly changed without performance drop.
  • The authors propose several sanity check methods (Section 3) on unstructured pruning methods that test whether the data used in the pruning step and whether the architecture of the pruned subnetwork are essential for the final performance.
  • The authors find that one kind of pruning method, classified as “initial tickets” (Section 2.3) hardly exploit any information from data, because randomly changing the preserved weights of the subnetwork obtained by these methods layerwise does not affect the final performance.
  • These findings inspire them to design a zero-shot data-independent pruning method called “random tickets” which outperforms or attains similar performance compared to initial tickets.
  • The authors identify one existing pruning method that passes the sanity checks, and hybridize the random tickets with this method to propose a new method called “hybrid tickets”, which achieves further improvement.
  • The authors' findings bring new insights in rethinking the key factors on the success of pruning algorithms.
表格
  • Table1: Test accuracy of pruned VGG19 and ResNet32 on CIFAR-10 and CIFAR-100 datasets. In the full paper, the bold number indicates the average accuracy is within the best confidence interval
  • Table2: Test accuracy of pruned VGG19 and ResNet32 on Tiny-Imagenet dataset
  • Table3: Sanity-check on partially-trained tickets on CIFAR-10 dataset
  • Table4: Test accuracy of partially-trained tickets and our hybrid tickets of VGG19 and ResNet32 on CIFAR-10 and CIFAR-100 datasets
  • Table5: Ablation study of different keep-ratios on CIFAR-10 dataset
  • Table6: Test accuracy of pruned VGG11/16 and ResNet20/56 on CIFAR-10 and CIFAR-100 datasets
  • Table7: Test accuracy of partially-trained tickets and our hybrid tickets of VGG11/16 and ResNet20/56 on CIFAR-10 and CIFAR-100 datasets
Download tables as Excel
研究对象与分析
recent papers: 4
We refer the readers to Figure 1 for an illustration of where we deploy our sanity checks. We then apply our sanity checks on the pruning methods in four recent papers from ICLR 2019 and 2020 [8, 21, 34, 37]. We first classify the subnetworks found by these methods into “initial tickets”, i.e., the weights before retraining are set to be the weights at initialization (this concept is the same as “winning tickets” in [8]), and “partially-trained tickets”, i.e., the weights before retraining are set to be the weights from the middle of pretraining process3 (See Section 4.1 for the formal definition and Figure 2 for an illustration)

引用论文
  • Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 322–332, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
    Google ScholarLocate open access versionFindings
  • Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
    Findings
  • Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Miguel A Carreira-Perpinán and Yerlan Idelbayev. “learning-compression” algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532–8541, 2018.
    Google ScholarLocate open access versionFindings
  • Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. pages 4857–4867, 2017.
    Google ScholarFindings
  • Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685, 2019.
    Google ScholarLocate open access versionFindings
  • Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. arXiv preprint arXiv:1912.05671, 2019.
    Findings
  • Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
    Findings
  • Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
    Findings
  • Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
    Google ScholarLocate open access versionFindings
  • Babak Hassibi, David G.Stork, and Gregory Wolff. Optimal brain surgeon and general network pruning. pages 293 – 299 vol.1, 02 1993.
    Google ScholarFindings
  • Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
    Google ScholarLocate open access versionFindings
  • Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Pruning untrained neural networks: Principles and analysis. arXiv preprint arXiv:2002.08797, 2020.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
    Google ScholarLocate open access versionFindings
  • Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
    Google ScholarLocate open access versionFindings
  • Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
    Google ScholarLocate open access versionFindings
  • Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
    Google ScholarLocate open access versionFindings
  • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. arXiv preprint arXiv:2002.00585, 2020.
    Findings
  • Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference.
    Google ScholarFindings
  • Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems, pages 4933–4943, 2019.
    Google ScholarLocate open access versionFindings
  • Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.
    Google ScholarLocate open access versionFindings
  • Ben Mussay, Margarita Osadchy, Vladimir Braverman, Samson Zhou, and Dan Feldman. Data-independent neural pruning via coresets. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Lecun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv: Learning, 2018.
    Google ScholarFindings
  • Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pages 4095–4104, 2018.
    Google ScholarLocate open access versionFindings
  • Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
    Findings
  • Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efficient training of deep networks. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
    Findings
  • Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pages 3592–3602, 2019.
    Google ScholarLocate open access versionFindings
  • Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
    Findings
  • Barret Zoph and Quoc Le. Neural architecture search with reinforcement learning. 2016.
    Google ScholarFindings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. Can be downloaded from http://cs231n.stanford.edu/tiny-imagenet-200.zip
    Locate open access versionFindings
  • 2. Set the keep-ratio of layer l as (L − l + 1)2 + (L − l + 1).
    Google ScholarFindings
  • 3. Linearly scale the keep-ratio of each layer such that the final number of retained weights is equal to (1 − p)
    Google ScholarFindings
  • 1. Ascending keep-ratio: Reverse our smart ratio.
    Google ScholarFindings
  • 2. Balanced keep-ratio: Set the keep-ratio of each layer to be 1 − p, where p is the target sparsity.
    Google ScholarFindings
  • 3. Linear decay: Set the keep-ratio of l-th convolutional layer to be proportional to L − l + 1.
    Google ScholarFindings
  • 4. Cubic decay: Set the keep-ratio of l-th convolutional layer to be proportional to (L − l + 1)3.
    Google ScholarFindings
您的评分 :
0

 

标签
评论