Movement Pruning: Adaptive Sparsity by Fine-Tuning

NIPS 2020, 2020.

Cited by: 6|Bibtex|Views132
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Magnitude pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 with L0 regularization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning

Abstract:

Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning ...More
0
Introduction
  • Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art performance in applications in natural language processing and related fields
  • In this setup, a large model pretrained on a massive generic dataset is fine-tuned on a smaller annotated dataset to perform a specific end-task.
  • Movement pruning differs from magnitude pruning in that both weights with low and high values can be pruned if they shrink during training
  • This strategy moves the selection criteria from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective.
Highlights
  • Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art performance in applications in natural language processing and related fields
  • This prevents these methods from learning to prune based on the fine-tuning step, or “fine-pruning.” In this work, we argue that to effectively reduce the size of models for transfer learning, one should instead use movement pruning, i.e., pruning approaches that consider the changes in weights during fine-tuning
  • Magnitude pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 with L0 regularization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning
  • We show that a simple method for weight pruning based on straight-through gradients is effective for this task and that it adapts using a first-order importance score
  • We apply this movement pruning to a transformer-based architecture and empirically show that our method consistently yields strong improvements over existing methods in high-sparsity regimes
  • First-order methods show strong performances with less than 15% of remaining weights
  • The analysis demonstrates how this approach adapts to the fine-tuning regime in a way that magnitude pruning cannot
Methods
  • The gradient of L with respect to Wi,j is given by the standard gradient derivation: ∂L ∂ai Mi,j xj
Results
  • Magnitude pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 with L0 regularization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning.
  • These experiments indicate that in high sparsity regimes, importance scores derived from the movement accumulated during fine-tuning induce significantly better pruned models compared to absolute values
Conclusion
  • The authors consider the case of pruning of pretrained models for task-specific fine-tuning and compare zeroth- and first-order pruning methods.
  • The authors show that a simple method for weight pruning based on straight-through gradients is effective for this task and that it adapts using a first-order importance score
  • The authors apply this movement pruning to a transformer-based architecture and empirically show that the method consistently yields strong improvements over existing methods in high-sparsity regimes.
  • It would be interesting to leverage group-sparsity inducing penalties [Bach et al, 2011] to remove entire columns or filters
  • In this setup, the authors would associate a score to a group of weights.
  • It would give a systematic way to perform feature selection and remove entire columns of the embedding matrix
Summary
  • Introduction:

    Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art performance in applications in natural language processing and related fields
  • In this setup, a large model pretrained on a massive generic dataset is fine-tuned on a smaller annotated dataset to perform a specific end-task.
  • Movement pruning differs from magnitude pruning in that both weights with low and high values can be pruned if they shrink during training
  • This strategy moves the selection criteria from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective.
  • Methods:

    The gradient of L with respect to Wi,j is given by the standard gradient derivation: ∂L ∂ai Mi,j xj
  • Results:

    Magnitude pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 with L0 regularization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning.
  • These experiments indicate that in high sparsity regimes, importance scores derived from the movement accumulated during fine-tuning induce significantly better pruned models compared to absolute values
  • Conclusion:

    The authors consider the case of pruning of pretrained models for task-specific fine-tuning and compare zeroth- and first-order pruning methods.
  • The authors show that a simple method for weight pruning based on straight-through gradients is effective for this task and that it adapts using a first-order importance score
  • The authors apply this movement pruning to a transformer-based architecture and empirically show that the method consistently yields strong improvements over existing methods in high-sparsity regimes.
  • It would be interesting to leverage group-sparsity inducing penalties [Bach et al, 2011] to remove entire columns or filters
  • In this setup, the authors would associate a score to a group of weights.
  • It would give a systematic way to perform feature selection and remove entire columns of the embedding matrix
Tables
  • Table1: Summary of the pruning methods considered in this work and their specificities. The expression of f of L0 regularization is detailed in Eq (3)
  • Table2: Performance at high sparsity levels. (Soft) movement pruning outperforms current state-of-the art pruning methods at different high sparsity levels
  • Table3: Distillation-augmented performances for selected high sparsity levels. All pruning methods benefit from distillation signal further enhancing the ratio Performance VS Model Size
Download tables as Excel
Related work
  • In addition to magnitude pruning, there are many other approaches for generic model weight pruning. Most similar to our approach are methods for using parallel score matrices to augment the weight matrices [Mallya and Lazebnik, 2018, Ramanujan et al, 2020], which have been applied for convolutional networks. Differing from our methods, these methods keep the weights of the model fixed (either from a randomly initialized network or a pre-trained network) and the scores are updated to find a good sparse subnetwork.

    Many previous works have also explored using higher-order information to select prunable weights. LeCun et al [1989] and Hassibi et al [1993] leverage the Hessian of the loss to select weights for deletion. Our method does not require the (possibly costly) computation of second-order derivatives since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis et al [2018] and Ding et al [2019] use the absolute value or the square value of the gradient. In contrast, we found it useful to preserve the direction of movement in our algorithm.
Funding
  • In highly sparse regimes (less than 15% of remaining weights), we observe significant improvements over magnitude pruning and other 1st-order methods such as L0 regularization [Louizos et al, 2017]
  • We observe the consistency of the comparison between magnitude and movement pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms all methods with little or no loss with respect to the dense model whereas the performance of movement pruning methods quickly decreases even for low sparsity levels
  • First-order methods show strong performances with less than 15% of remaining weights
  • Despite being able to find a global sparsity structure, we found that global did not significantly outperform local, except in high sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning)
Reference
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019.
    Findings
  • Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ArXiv, abs/1902.09574, 2019.
    Findings
  • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket hypothesis at scale. ArXiv, abs/1903.01611, 2019.
    Findings
  • Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ArXiv, abs/1308.3432, 2013.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, fixed network by learning to mask. ArXiv, abs/1801.06519, 2018.
    Findings
  • Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In NIPS, 1989.
    Google ScholarLocate open access versionFindings
  • Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and performance comparisons. In NIPS, 1993.
    Google ScholarLocate open access versionFindings
  • Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and fisher pruning. ArXiv, abs/1801.05787, 2018.
    Findings
  • Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse momentum sgd for pruning very deep neural networks. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS EMC2 Workshop, 2019.
    Google ScholarLocate open access versionFindings
  • Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. ArXiv, abs/1903.12136, 2019.
    Findings
  • Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. In ICLR, 2020a.
    Google ScholarLocate open access versionFindings
  • Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and multiple languages: lottery tickets in rl and nlp. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin. Training with quantization noise for extreme model compression. ArXiv, abs/2004.07320, 2020b.
    Findings
  • Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. ArXiv, abs/1910.06188, 2019.
    Findings
  • Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional networks using vector quantization. ArXiv, abs/1412.6115, 2014.
    Findings
  • Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of transformers. ArXiv, abs/2002.11794, 2020.
    Findings
  • Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of weight pruning on transfer learning. ArXiv, abs/2002.08307, 2020.
    Findings
  • Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
    Findings
  • Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017. URL https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
    Findings
  • Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal pruning for large-scale language representation. ArXiv, abs/1909.12486, 2019.
    Findings
  • Neal Parikh and Stephen P. Boyd. Proximal algorithms. Found. Trends Optim., 1:127–239, 2014. Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv, abs/1908.08962, 2019. Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006. Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS, 2014.
    Findings
  • Tinybert: Distilling bert for natural language understanding. ArXiv, abs/1909.10351, 2019. Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018. Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.
    Findings
  • As the scores are updated, the relative order of the importances is likely shuffled, and some connections will be replaced by more important ones. Under certain conditions, we are able to formally prove that as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from [Ramanujan et al., 2020] to consider the case of fine-tuable W.
    Google ScholarFindings
  • We note that this proof is not specific to the TopK masking function. In fact, we can extend the proof using the Threshold masking function M:= (S >= τ ) [Mallya and Lazebnik, 2018]. Inequalities (6) are still valid and the proof stays unchanged.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments