AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Experiments on various tasks show that under the same efficiency constraint, sub-networks extracted from the proposed DynaBERT consistently achieve better performance than the other BERT compression methods

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Cited by: 3|Views74
Full Text
Bibtex
Weibo

Abstract

The pre-trained language models like BERT and RoBERTa, though powerful in many natural language processing tasks, are both computational and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually reduce the large BERT model to ...More

Code:

Data:

0
Introduction
  • Pre-trained language models based on the Transformer [22] structure like BERT [8] and RoBERTa [14] have achieved remarkable results on natural language processing tasks.
  • These models have a lot of parameters, hindering their deployment on edge devices with limited storage, computation and energy consumption.
  • Note that unless otherwise specified, the BERT model mentioned in this paper refers to a task-specific BERT rather than the pretrained model
Highlights
  • Pre-trained language models based on the Transformer [22] structure like BERT [8] and RoBERTa [14] have achieved remarkable results on natural language processing tasks
  • We propose a novel dynamic BERT, or DynaBERT for short, which can be executed at different widths and depths for specific tasks
  • We extensively evaluated the effectiveness of our proposed dynamic BERT on the General Language Understanding Evaluation (GLUE) benchmark under various efficiency constraints (#parameters, FLOPs, inference speed), using both BERTBASE and RoBERTaBASE as the backbone models
  • We evaluate the efficacy of our proposed DynaBERT and DynaRoBERTa under different efficiency constraints, including #parameters, FLOPs, the latency on NVIDIA K40 GPU and on Kirin 810 A76 ARM CPU
  • Sub-networks from DynaRoBERTa most of the time perform significantly better than those from DynaBERT under the same depth and width
  • Experiments on various tasks show that under the same efficiency constraint, sub-networks extracted from the proposed DynaBERT consistently achieve better performance than the other BERT compression methods
Methods
  • For DynaBERTW (Section 3.1), the authors rewire the network only once before training by alternating over four different width multipliers.
  • Instead of rewiring the network only once before training, “progressive rewiring” progressively rewires the network as more width multipliers are supported throughout the training.
  • For four width multipliers [1.0, 0.75, 0.5, 0.25], progressive rewiring first sorts the attention heads and neurons and rewires the corresponding connections before training to support width multipliers [1.0, 0.75].
  • By comparing with Table 3, using progressive rewiring has no significant gain over rewiring only once
Results
  • Results on development set are used for evaluation.
  • The authors evaluate the efficacy of the proposed DynaBERT and DynaRoBERTa under different efficiency constraints, including #parameters, FLOPs, the latency on NVIDIA K40 GPU and on Kirin 810 A76 ARM CPU.
  • The Transformer layers are nearly compressed by rate mw × md, if the authors do not count the parameters in layer normalization and linear layer bias which are negligible
Conclusion
  • 5.1 Comparison of Conventional Distillation and Inplace Distillation

    To train width-adaptive CNNs, in [27], inplace distillation is used to boost the performance.
  • The authors adapt inplace distillation to train DynaBERTW and compare it with the conventional distillation used in Section 3.1 For inplace distillation, the loss for the student is the distillation loss over logit, embedding and hidden states with the teacher.
  • The loss of the teacher is the distillation loss over logits, embedding and hidden states with a fixed fine-tuned task-specific BERT.Conclusion and Future Work.
  • This work focuses on training a dynamic BERT on specific tasks, in the future, the authors would like to apply the proposed method to the pre-training stage
Summary
  • Introduction:

    Pre-trained language models based on the Transformer [22] structure like BERT [8] and RoBERTa [14] have achieved remarkable results on natural language processing tasks.
  • These models have a lot of parameters, hindering their deployment on edge devices with limited storage, computation and energy consumption.
  • Note that unless otherwise specified, the BERT model mentioned in this paper refers to a task-specific BERT rather than the pretrained model
  • Methods:

    For DynaBERTW (Section 3.1), the authors rewire the network only once before training by alternating over four different width multipliers.
  • Instead of rewiring the network only once before training, “progressive rewiring” progressively rewires the network as more width multipliers are supported throughout the training.
  • For four width multipliers [1.0, 0.75, 0.5, 0.25], progressive rewiring first sorts the attention heads and neurons and rewires the corresponding connections before training to support width multipliers [1.0, 0.75].
  • By comparing with Table 3, using progressive rewiring has no significant gain over rewiring only once
  • Results:

    Results on development set are used for evaluation.
  • The authors evaluate the efficacy of the proposed DynaBERT and DynaRoBERTa under different efficiency constraints, including #parameters, FLOPs, the latency on NVIDIA K40 GPU and on Kirin 810 A76 ARM CPU.
  • The Transformer layers are nearly compressed by rate mw × md, if the authors do not count the parameters in layer normalization and linear layer bias which are negligible
  • Conclusion:

    5.1 Comparison of Conventional Distillation and Inplace Distillation

    To train width-adaptive CNNs, in [27], inplace distillation is used to boost the performance.
  • The authors adapt inplace distillation to train DynaBERTW and compare it with the conventional distillation used in Section 3.1 For inplace distillation, the loss for the student is the distillation loss over logit, embedding and hidden states with the teacher.
  • The loss of the teacher is the distillation loss over logits, embedding and hidden states with a fixed fine-tuned task-specific BERT.Conclusion and Future Work.
  • This work focuses on training a dynamic BERT on specific tasks, in the future, the authors would like to apply the proposed method to the pre-training stage
Tables
  • Table1: Results on development set of our proposed DynaBERT and DynaRoBERTa with different width and depth multipliers (mw, md). The highest accuracy among 12 different configurations in each block is highlighted
  • Table2: Results on test set of our proposed DynaBERT and DynaRoBERTa. Note that the evaluation metric for QQP and MRPC here is “F1”
  • Table3: Ablation study in the training of DynaBERTW. Results on the development set are reported. The highest average accuracy over four width multipliers is highlighted
  • Table4: Ablation study in the training of DynaBERT. Results on the development set are reported
  • Table5: Comparison of results on development set between using conventional distillation and inplace distillation. For DynaBERTW, the average accuracy over four width multipliers are reported. For DynaBERT, the average accuracy over four width multipliers and three depth multipliers are reported. The higher accuracy in each group is highlighted
  • Table6: Results on development set on the GLUE benchmark using progressive rewiring in training
  • Table7: Results on development set on the GLUE benchmark using universal slimmable training in training DynaBERTW. mw MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE avg
  • Table8: Hyperparameters for different stages in training DynaBERT and DynaRoBERTa on the
Download tables as Excel
Related work
  • In this section, we first describe the formulation of Transformer layer in BERT. Then we briefly review related work on compression of Transformer-based models.

    2.1 Transformer Layer

    The BERT model is built with Transformer Encoder layers [22], which capture long-term dependencies between input tokens by self-attention mechanism. Specifically, a standard Transformer layer contains a Multi-Head Attention (MHA) layer and a Feed-Forward Network (FFN).

    For the t-th Transformer layer, suppose the input to it is X ∈ Rn×d where n and d are the sequence length and hidden state size. Suppose there are NH attention heads in each layer, with head h parameterized by WhQ, WhK , WhV , WhO ∈ Rd×dh where dh = d NH

    , and output computed as AttnhWhQ ,WhK ,WhV
Funding
  • Sub-networks from DynaRoBERTa most of the time perform significantly better than those from DynaBERT under the same depth and width
  • By comparing with Table 3, there is no significant difference between using universally slimmable training and the alternative training as used in Algorithm 2
Study subjects and analysis
data sets: 4
The DynaBERT trained without knowledge distillation, data augmentation and final fine-tuning is called “vanilla DynaBERT”. From Table 4, with knowledge distillation and data augmentation, the average accuracy of smaller depth is significantly improved compared to vanilla counterpart on all four data sets. Additional fine-tuning further improves the average accuracy on all three depth multipliers on SST-2, CoLA and two on RTE, but harms the performance on MRPC

Reference
  • S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, pages 688–699, 2019.
    Google ScholarLocate open access versionFindings
  • A. Bhandare, V. Sripathi, D. Karkada, V. Menon, S. Choi, K. Datta, and V. Saletore. Efficient 8-bit quantization of transformer neural machine language translation model. Preprint arXiv:1906.00532, 2019.
    Findings
  • A. Bie, B. Venkitesh, J. Monteiro, M. Haidar, M. Rezagholizadeh, et al. Fully quantizing a simplified transformer for end-to-end speech recognition. Preprint arXiv:1911.03604, 2019.
    Findings
  • H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • B. Cui, Y. Li, M. Chen, and Z. Zhang. Fine-tune BERT with sparse self-attention mechanism. In Conference on Empirical Methods in Natural Language Processing, 2019.
    Google ScholarLocate open access versionFindings
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • M. Elbayad, J. Gu, E. Grave, and M. Auli. Depth-adaptive transformer. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • M. A. Gordon, K. Duh, and N. Andrews. Compressing bert: Studying the effects of weight pruning on transfer learning. Preprint arXiv:2002.08307, 2020.
    Findings
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. Tinybert: Distilling bert for natural language understanding. Preprint arXiv:1909.10351, 2019.
    Findings
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. Preprint arXiv:1907.11692, 2019.
    Findings
  • X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, D. Song, and M. Zhou. A tensorized transformer for language modeling. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • J.S. McCarley. Pruning a bert-based question answering model. Preprint arXiv:1910.06360, 2019.
    Findings
  • P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, pages 14014–14024, 2019.
    Google ScholarLocate open access versionFindings
  • P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. Preprint arXiv:1910.01108, 2019.
    Findings
  • S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer. Qbert: Hessian based ultra low precision quantization of bert. In AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu. Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, pages 4314–4323, 2019.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Annual Conference of the Association for Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Preprint arXiv:2002.10957, 2020.
    Findings
  • J. Yu and T. S. Huang. Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers. Preprint arXiv:1903.11728, 2019.
    Findings
  • J. Yu and T. S. Huang. Universally slimmable networks and improved training techniques. In IEEE International Conference on Computer Vision, pages 1803–1811, 2019.
    Google ScholarLocate open access versionFindings
  • J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang. Slimmable neural networks. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat. Q8bert: Quantized 8bit bert. Preprint arXiv:1910.06188, 2019.
    Findings
Your rating :
0

 

Tags
Comments
小科