TernaryBERT: Distillation aware Ultra low Bit BERT

EMNLP 2020, pp.509-521, (2020)

被引用0|浏览206
下载 PDF 全文
引用
微博一下

摘要

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT...更多

代码

数据

0
简介
  • Transformer-based models have shown great power in various natural language processing (NLP) tasks.
  • The BERT-base model has 109M parameters, with the model size of 400+MB if represented in 32-bit floating-point format, which is both computation and memory expensive during inference
  • This poses great challenges for these models to run on resource-constrained devices like cellphones.
  • In (Prato et al, 2019; Zafrir et al, 2019), 8-bit quantization is successfully applied to Transformer-based models with comparable performance as the full-precision baseline.
  • Mixed-precision quantization is unfriendly to some hardwares, and PQ requires extra clustering operations
重点内容
  • Transformer-based models have shown great power in various natural language processing (NLP) tasks
  • Instead of directly using knowledge distillation to compress a model, we use it to improve the performance of ternarized student model with the same size as the teacher model
  • We investigate the ternarization granularity of different parts of the BERT model, and apply various distillation losses to improve the performance of TernaryBERT
  • We proposed to use approximationbased and loss-aware ternarization to ternarize the weights in the BERT model, with different granularities for word embedding and weights in the Transformer layers
  • Distillation is used to reduce the accuracy drop caused by lower capacity due to quantization
  • Empirical experiments show that the proposed TernaryBERT outperforms stateof-the-art BERT quantization methods and even performs comparably as the full-precision BERT
方法
  • W-E-A Size Accuracy (#bits) (MB) (%) TinyBERT-4L ALBERT-E64 ALBERT-E128 ALBERT-E256 ALBERT-E768 LayerDrop-6L LayerDrop-3L PQ 2/4-8-8 53 2/3-8-8 46
结果
  • Results on BERT and TinyBERT

    Table 1 shows the development set results on the GLUE benchmark.
  • 2) When the number of bits for weight increases to 8, the performance of all quantized models is greatly improved and is even comparable as the full-precision baseline, which indicates that the setting ‘8-8-8’ is not challenging for BERT.
  • TernaryBERT significantly outperforms Q-BERT and Q2BERT, and is even comparable as the full-precision baseline.
  • For this task, LAT performs slightly better than TWN.
结论
  • The authors proposed to use approximationbased and loss-aware ternarization to ternarize the weights in the BERT model, with different granularities for word embedding and weights in the Transformer layers.
  • Distillation is used to reduce the accuracy drop caused by lower capacity due to quantization.
表格
  • Table1: Development set results of quantized BERT and TinyBERT on the GLUE benchmark. We abbreviate the number of bits for weights of Transformer layers, word embedding and activations as “W-E-A (#bits)”
  • Table2: Test set results of the proposed quantized BERT and TinyBERT on the GLUE benchmark
  • Table3: Development set results on SQuAD
  • Table4: Development set results of TernaryBERTTWN with different ternarization granularities on weights in
  • Table5: Comparison of symmetric 8-bit and min-max 8-bit activation quantization methods on SQuAD v1.1
  • Table6: Effects of knowledge distillation on the Transformer layers and logits on TernaryBERTTWN. “-Trmlogits” means we use cross-entropy loss w.r.t. the ground-truth labels as the training objective
  • Table7: Effects of data augmentation and initialization
  • Table8: Comparison between the proposed method and other compression methods on MNLI-m. Note that Quant-Noise uses Product Quantization (PQ) and does not have specific number of bits for each value
  • Table9: Comparison between TernaryBERT and mixed-precision Q-BERT
  • Table10: Development set results of 3-bit quantized BERT and TinyBERT on GLUE benchmark
Download tables as Excel
相关工作
  • 2.1 Knowledge Distillation

    Knowledge distillation is first proposed in (Hinton et al, 2015) to transfer knowledge in the logits from a large teacher model to a more compact student model without sacrificing too much performance. It has achieved remarkable performance in NLP (Kim and Rush, 2016; Jiao et al, 2019) recently. Besides the logits (Hinton et al, 2015), knowledge from the intermediate representations (Romero et al, 2014; Jiao et al, 2019) and attentions (Jiao et al, 2019; Wang et al, 2020) are also used to guide the training of a smaller BERT.

    Instead of directly being used for compression, knowledge distillation can also be used in combination with other compression methods like pruning (McCarley, 2019; Mao et al, 2020), low-rank approximation (Mao et al, 2020) and dynamic networks (Hou et al, 2020), to fully leverage the knowledge of the teacher BERT model. Although combining quantization and distillation has been explored in convolutional neural networks (CNNs) (Polino et al, 2018; Stock et al, 2020; Kim et al, 2019), using knowledge distillation to train quantized BERT has not been studied. Compared with CNNs which simply perform convolution in each layer, the BERT model is more complicated with each Transformer layer containing both a Multi-Head Attention mechanism and a positionwise Feed-forward Network. Thus the knowledge that can be distilled in a BERT model is also much richer (Jiao et al, 2019; Wang et al, 2020).
基金
  • Instead of directly using knowledge distillation to compress a model, we use it to improve the performance of ternarized student model with the same size as the teacher model
  • In Section 3.2, we introduce the distillation loss used to improve the performance of the ternarized model
  • The quantized BERT uses low bits to represent the model parameters and activations. Therefore it results in relatively low capacity and worse performance compared with the full-precision counterpart. To alleviate this problem, we incorporate the technique of knowledge distillation to improve performance of the quantized BERT
  • As can be seen, without distillation over the Transformer layers, the performance drops by 3% or more on CoLA and RTE, and also slightly on MNLI
引用论文
  • M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. 2019. Universal transformers. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • A. Fan, E. Grave, and A. Joulin. 2019. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, and A. Joulin. 2020. Training with quantization noise for extreme model compression. Preprint arXiv:2004.07320.
    Findings
  • G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. Preprint arXiv:1503.02531.
    Findings
  • L. Hou and J. T. Kwok. 2018. Loss-aware weight quantization of deep networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • L. Hou, Yao Q., and J. T. Kwok. 2017. Loss-aware binarization of deep networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • L. Hou, L. Shang, X. Jiang, and Q. Liu. 2020. Dynabert: Dynamic bert with adaptive width and depth. Preprint arXiv:2004.04037.
    Findings
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. 2019. Tinybert: Distilling bert for natural language understanding. Preprint arXiv:1909.10351.
    Findings
  • J. Kim, Y. Bhalgat, J. Lee, C. Patel, and N. Kwak. 2019. Qkd: Quantization-aware knowledge distillation. Preprint arXiv:1911.12491.
    Findings
  • Y. Kim and A. M. Rush. 2016. Sequence-level knowledge distillation. In Conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2019. Albert: A lite bert for selfsupervised learning of language representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. Albert: A lite bert for selfsupervised learning of language representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin. 2018. Extremely low bit neural network: Squeeze the last bit out with admm. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • F. Li, B. Zhang, and B. Liu. 2016. Ternary weight networks. Preprint arXiv:1605.04711.
    Findings
  • Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. 2016. Neural networks with few multiplications. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • W. Liu, P. Zhou, Z. Zhao, Z. Wang, H. Deng, and Q. Ju. 2020. Fastbert: a self-distilling bert with adaptive inference time. In Annual Conference of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • I. Loshchilov and F. Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, D. Song, and M. Zhou. 2019. A tensorized transformer for language modeling. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Y. Mao, Y. Wang, C. Wu, C. Zhang, Y. Wang, Y. Yang, Q. Zhang, Y. Tong, and J. Bai. 2020. Ladabert: Lightweight adaptation of bert through hybrid model compression. Preprint arXiv:2004.04124.
    Findings
  • J. S. McCarley. 2019. Pruning a bert-based question answering model. Preprint arXiv:1910.06360.
    Findings
  • P. Michel, O. Levy, and G. Neubig. 2019. Are sixteen heads really better than one? Preprint arXiv:1905.10650.
    Findings
  • A. Polino, R. Pascanu, and D. Alistarh. 2018. Model compression via distillation and quantization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • G. Prato, E. Charlaix, and M. Rezagholizadeh. 2019.
    Google ScholarLocate open access versionFindings
  • Fully quantized transformer for improved transla- 2019. Q8bert: Quantized 8bit bert. Preprint tion. Preprint arXiv:1910.10485.
    Findings
  • P. Rajpurkar, R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for squad. Preprint arXiv:1806.03822.
    Findings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. Preprint arXiv:1606.05250.
    Findings
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542.
    Google ScholarLocate open access versionFindings
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. 2014. Fitnets: Hints for thin deep nets. Preprint arXiv:1412.6550.
    Findings
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf. 2019.
    Google ScholarFindings
  • S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer. 2020. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Jegou. 2020. And the bit goes down: Revisiting the quantization of neural networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu. 2019. Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, pages 4314–4323.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998– 6008.
    Google ScholarLocate open access versionFindings
  • E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Annual Conference of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. Preprint arXiv:1804.07461.
    Findings
  • W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Preprint arXiv:2002.10957.
    Findings
  • A. H. Zadeh and A. Moshovos. 2020. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. Preprint arXiv:2005.03842.
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科