## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient

NIPS 2020, (2020)

EI

Keywords

Abstract

The strong lottery ticket hypothesis (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently overparameterized random network. A recent work by Malach et al. [<a class="ref-link" id="c1" href="#r1">1</a>] establishes the first theoretical analysis for the strong LTH: one can provab...More

Code:

Data:

Introduction

- Many of the recent unprecedented successes of machine learning can be partially attributed to stateof-the-art neural network architectures that come with up to tens of billions of trainable parameters.
- Test accuracy is one of the gold standards in choosing one of these architectures, in many applications having a “compressed” model is of practical interest, due to typically reduced energy, memory, and computational footprint [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
- Such a compressed form can be achieved by either modifying the architecture to be leaner in terms of number of weights, or by starting with a highaccuracy network and pruning it down to one that is sparse in some representation domain, while not sacrificing much of the original network’s accuracy.
- Several of these pruning methods require many rounds of pruning and retraining, resulting in a time-consuming and hard to tune iterative meta-algorithm

Highlights

- Many of the recent unprecedented successes of machine learning can be partially attributed to stateof-the-art neural network architectures that come with up to tens of billions of trainable parameters
- We provide a lower bound for 2-layered networks that matches the upper bound proposed in Theorem 1, up to logarithmic terms with regards to the width: Theorem 2. There exists a 2-layer neural network with width d which cannot be approximated to error within by pruning a randomly initialized 2-layer network, unless the random network has width at least Ω(d log(1/ ))
- We present our results for approximating a target network by pruning a sufficiently overparameterized neural network
- We verify our results empirically by approximating a target network via SUBSETSUM in Experiment 1, and by pruning a sufficiently over-parameterized neural network that implements the structures in Figures 1b and 1c in Experiment 2
- It would be interesting to extend the results to convolutional neural networks
- As remarked in Malach et al [1], the strong lottery ticket hypothesis (LTH) implies that pruning an over-parameterized network to obtain good accuracy is NP-Hard in the worst case

Methods

- The authors verify the results empirically by approximating a target network via SUBSETSUM in Experiment 1, and by pruning a sufficiently over-parameterized neural network that implements the structures in Figures 1b and 1c in Experiment 2.
- The 397, 000 weights in the target network were approximated with 3, 725, 871 coefficients in 21.5 hours on 36 cores of a c5.18xlarge AWS EC2 instance
- Such a running time is attributed to solving many instances of this nontrivial combinatorial problem

Results

- The authors approximate a two-layer, 500 hidden node target network with a final test set accuracy of 97.19%.

Conclusion

- In this paper the authors establish a tight version of the strong lottery ticket hypothesis: there always exist subnetworks of randomly initialized over-parameterized networks that can come close to the accuracy of a target network; further this can be achieved by random networks that are only a logarithmic factor wider than the original network.
- Other interesting structures that come up in neural networks are sparsity and low-rank weight matrices.
- This leads to the question of whether the authors can leverage the additional structure in the target network to improve the results.
- An interesting question from a computational point of view is whether the analysis gives insights to improve the existing pruning algorithms [26].
- It is an interesting future direction to find efficient algorithms for pruning which provably work under mild assumptions on the data

Summary

## Introduction:

Many of the recent unprecedented successes of machine learning can be partially attributed to stateof-the-art neural network architectures that come with up to tens of billions of trainable parameters.- Test accuracy is one of the gold standards in choosing one of these architectures, in many applications having a “compressed” model is of practical interest, due to typically reduced energy, memory, and computational footprint [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
- Such a compressed form can be achieved by either modifying the architecture to be leaner in terms of number of weights, or by starting with a highaccuracy network and pruning it down to one that is sparse in some representation domain, while not sacrificing much of the original network’s accuracy.
- Several of these pruning methods require many rounds of pruning and retraining, resulting in a time-consuming and hard to tune iterative meta-algorithm
## Objectives:

It does not reflect the findings in Ramanujan et al [26] that only seem to require a constant factor over-parameterization, e.g., a randomly initialized Wide ResNet50, can be pruned to a model that has the accuracy of a fully trained ResNet34.- The authors' goal is to address the following question:.
- The authors' goal is to approximate a target network f (x) by pruning a larger network g(x), where x ∈ Rd0.
- The authors' goal is to obtain a pruned version of g by eliminating weights, i.e.,.
- The authors' objective is to obtain a good approximation while controlling the size of g(·), i.e., the width of Mi’s
## Methods:

The authors verify the results empirically by approximating a target network via SUBSETSUM in Experiment 1, and by pruning a sufficiently over-parameterized neural network that implements the structures in Figures 1b and 1c in Experiment 2.- The 397, 000 weights in the target network were approximated with 3, 725, 871 coefficients in 21.5 hours on 36 cores of a c5.18xlarge AWS EC2 instance
- Such a running time is attributed to solving many instances of this nontrivial combinatorial problem
## Results:

The authors approximate a two-layer, 500 hidden node target network with a final test set accuracy of 97.19%.## Conclusion:

In this paper the authors establish a tight version of the strong lottery ticket hypothesis: there always exist subnetworks of randomly initialized over-parameterized networks that can come close to the accuracy of a target network; further this can be achieved by random networks that are only a logarithmic factor wider than the original network.- Other interesting structures that come up in neural networks are sparsity and low-rank weight matrices.
- This leads to the question of whether the authors can leverage the additional structure in the target network to improve the results.
- An interesting question from a computational point of view is whether the analysis gives insights to improve the existing pruning algorithms [26].
- It is an interesting future direction to find efficient algorithms for pruning which provably work under mild assumptions on the data

Funding

- Acknowledgments and Disclosure of Funding DP wants to thank Costis Daskalakis and Alex Dimakis for early discussions of the problem over a nice sushi lunch during NeurIPS2019 in Vancouver, BC. This research is supported by an NSF CAREER Award #1844951, a Sony Faculty Innovation Award, an AFOSR & AFRL Center of Excellence Award FA9550-18-1-0166, and an NSF TRIPODS Award #1740707

Reference

- Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. Proving the Lottery Ticket Hypothesis: Pruning is All You Need. February 2020.
- Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 2020.
- Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
- Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs], February 2016.
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
- Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
- Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
- Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
- Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv:1710.09282 [cs], September 2019.
- Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? In Proceedings of Machine Learning and Systems 2020, pages 129–146. 2020.
- Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal Brain Surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan-Kaufmann, 1993.
- Asriel U. Levin, Todd K. Leen, and John E. Moody. Fast Pruning Using Principal Components. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 35–42. Morgan-Kaufmann, 1994.
- Michael C Mozer and Paul Smolensky. Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 1, pages 107–115. Morgan-Kaufmann, 1989.
- Yann LeCun, John S. Denker, and Sara A. Solla. Optimal Brain Damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 598–605. Morgan-Kaufmann, 1990.
- Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations, September 2018.
- Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. arXiv:1905.01067 [cs, stat], March 2020.
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear Mode Connectivity and the Lottery Ticket Hypothesis. arXiv:1912.05671 [cs, stat], February 2020.
- Justin Cosentino, Federico Zaiter, Dan Pei, and Jun Zhu. The search for sparse, robust neural networks, 2019.
- R. V. Soelen and J. W. Sheppard. Using winning lottery tickets in transfer learning for convolutional neural networks. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2019.
- Matthia Sabatelli, Mike Kestemont, and Pierre Geurts. On the transferability of winning tickets in non-natural image datasets. arXiv preprint arXiv:2005.05232, 2020.
- Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s Hidden in a Randomly Weighted Neural Network? arXiv:1911.13299 [cs], March 2020.
- Yulong Wang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from Scratch. arXiv:1909.12579 [cs], September 2019.
- Richard M. Karp. Reducibility among Combinatorial Problems. In Raymond E. Miller, James W. Thatcher, and Jean D. Bohlinger, editors, Complexity of Computer Computations: Proceedings of a Symposium on the Complexity of Computer Computations, Held March 20–22, 1972., The IBM Research Symposia Series, pages 85–103. Springer US, Boston, MA, 1972.
- Narendra Karmarkar, Richard M. Karp, George S. Lueker, and Andrew M. Odlyzko. Probabilistic Analysis of Optimum Partitioning. Journal of Applied Probability, 23(3):626–645, 1986.
- George S. Lueker. On the Average Difference between the Solutions to Linear and Integer Knapsack Problems. In Ralph L. Disney and Teunis J. Ott, editors, Applied ProbabilityComputer Science: The Interface Volume 1, Progress in Computer Science, pages 489–504. Birkhäuser, Boston, MA, 1982.
- George S. Lueker. Exponentially small bounds on the expected optimum of the partition and subset sum problems. Random Structures & Algorithms, 12(1):51–62, 1998.
- Laurent Orseau, Marcus Hutter, and Omar Rivasplata. Logarithmic pruning is all you need, 2020.
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016.
- LLC Gurobi Optimization. Gurobi optimizer reference manual, 2020.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.
- Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019.
- Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

Tags

Comments