FastBERT: a Self-distilling BERT with Adaptive Inference Time

ACL, pp. 6035-6044, 2020.

Cited by: 1|Bibtex|Views129
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Floating-point operations is a measure of the computational complexity of models, which indicates the number of floating-point operations that the model performs for a single process

Abstract:

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To improve their efficiency with an assured model performance, we propose a novel speed-tunable FastBERT...More
Introduction
  • Last two years have witnessed significant improvements brought by language pre-training, such as BERT (Devlin et al, 2019), GPT (Radford et al, 2018), and XLNet (Yang et al, 2019).
  • By pretraining on unlabeled corpus and fine-tuning on labeled ones, BERT-like models achieved huge gains on many Natural Language Processing tasks.
  • Despite this gain in accuracy, these models have greater costs in computation and slower speed at inference, which severely impairs their practicalities.
  • A large number of servers need to be deployed to enable BERT in industrial settings, and many spare servers need to be reserved to cope with the peak period of requests, demanding huge costs
Highlights
  • Last two years have witnessed significant improvements brought by language pre-training, such as BERT (Devlin et al, 2019), GPT (Radford et al, 2018), and XLNet (Yang et al, 2019)
  • Floating-point operations (FLOPs) is a measure of the computational complexity of models, which indicates the number of floating-point operations that the model performs for a single process
  • We evaluate the text inference capabilities of these models on the twelve datasets and report their accuracy (Acc.) and sample-averaged Floating-point operations under different Speed values
Methods
  • As shown in Figure 2, FastBERT consists of backbone and branches.
  • The backbone is built upon 12-layers Transformer encoder with an additional teacher-classifier, while the branches include student-classifiers which are appended to each Transformer output to enable early outputs.
  • 3.1.1 Backbone The backbone consists of three parts: the embedding layer, the encoder containing stacks of Transformer blocks (Vaswani et al, 2017), and the teacher classifier.
  • The structure of the embedding layer and the encoder conform with those of BERT Backbone.
Results
  • Experimental results on six Chinese and six

    English NLP tasks have demonstrated that FastBERT achieves a huge retrench in computation with very little loss in accuracy.
  • The authors list the measured FLOPs of both structures in Table 1, from which the authors can infer that, the calculation load (FLOPs) of the Classifier is much lighter than that of the Transformer.
  • This is the basis of the speed-up of FastBERT, for it adds additional classifiers, it achieves acceleration by reducing more computation in Transformers
Conclusion
  • FastBERT adopts a self-distillation mechanism during the training phase and an adaptive mechanism in the inference phase, achieving the goal of gaining more efficiency with less accuracy loss.
  • Self-distillation and adaptive inference are first introduced to NLP model in this paper.
  • Empirical results have shown that FastBERT can be 2 to 3 times faster than BERT without performance degradation.
  • If the authors slack the tolerated loss in accuracy, the model is free to tune its speedup between 1 and 12 times.
  • FastBERT remains compatible to the parameter settings of other BERT-like models (e.g., BERTWWM, ERNIE, and RoBERTa), which means these public available models can be readily loaded for FastBERT initialization
Summary
  • Introduction:

    Last two years have witnessed significant improvements brought by language pre-training, such as BERT (Devlin et al, 2019), GPT (Radford et al, 2018), and XLNet (Yang et al, 2019).
  • By pretraining on unlabeled corpus and fine-tuning on labeled ones, BERT-like models achieved huge gains on many Natural Language Processing tasks.
  • Despite this gain in accuracy, these models have greater costs in computation and slower speed at inference, which severely impairs their practicalities.
  • A large number of servers need to be deployed to enable BERT in industrial settings, and many spare servers need to be reserved to cope with the peak period of requests, demanding huge costs
  • Methods:

    As shown in Figure 2, FastBERT consists of backbone and branches.
  • The backbone is built upon 12-layers Transformer encoder with an additional teacher-classifier, while the branches include student-classifiers which are appended to each Transformer output to enable early outputs.
  • 3.1.1 Backbone The backbone consists of three parts: the embedding layer, the encoder containing stacks of Transformer blocks (Vaswani et al, 2017), and the teacher classifier.
  • The structure of the embedding layer and the encoder conform with those of BERT Backbone.
  • Results:

    Experimental results on six Chinese and six

    English NLP tasks have demonstrated that FastBERT achieves a huge retrench in computation with very little loss in accuracy.
  • The authors list the measured FLOPs of both structures in Table 1, from which the authors can infer that, the calculation load (FLOPs) of the Classifier is much lighter than that of the Transformer.
  • This is the basis of the speed-up of FastBERT, for it adds additional classifiers, it achieves acceleration by reducing more computation in Transformers
  • Conclusion:

    FastBERT adopts a self-distillation mechanism during the training phase and an adaptive mechanism in the inference phase, achieving the goal of gaining more efficiency with less accuracy loss.
  • Self-distillation and adaptive inference are first introduced to NLP model in this paper.
  • Empirical results have shown that FastBERT can be 2 to 3 times faster than BERT without performance degradation.
  • If the authors slack the tolerated loss in accuracy, the model is free to tune its speedup between 1 and 12 times.
  • FastBERT remains compatible to the parameter settings of other BERT-like models (e.g., BERTWWM, ERNIE, and RoBERTa), which means these public available models can be readily loaded for FastBERT initialization
Tables
  • Table1: FLOPs of each operation within the FastBERT (M = Million, N = the number of labels)
  • Table2: Comparison of accuracy (Acc.) and FLOPs (speedup) between FastBERT and Baselines in six Chinese datasets and six English datasets
  • Table3: Results of ablation studies on the Book review and Yelp.P datasets
Download tables as Excel
Related work
  • BERT (Devlin et al, 2019) can learn universal knowledge from mass unlabeled data and produce more performant outcomes. Many works have followed: RoBERTa (Liu et al, 2019) that uses larger corpus and longer training steps. T5 (Raffel et al, 2019) that scales up the model size even more. UER (Zhao et al, 2019) pre-trains BERT in different Chinese corpora. K-BERT (Liu et al, 2020) injects knowledge graph into BERT model. These models achieve increased accuracy with heavier settings and even more data.
Funding
  • This work is funded by 2019 Tencent Rhino-Bird Elite Training Program
Reference
  • Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-training with whole word masking for chinese BERT. arXiv preprint arXiv:1906.08101.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of ACL, pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. 2017. Spatially adaptive computation time for residual networks. In Proceedings of CVPR, pages 1790–1799.
    Google ScholarLocate open access versionFindings
  • Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 201Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115.
    Findings
  • Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983.
    Findings
  • Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in NeurIPS, pages 1135–1143.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. Computer Science, 14(7):38–39.
    Google ScholarLocate open access versionFindings
  • Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351.
    Findings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of EMNLP, pages 1746–1751.
    Google ScholarLocate open access versionFindings
  • Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT. In Proceedings of EMNLP-IJCNLP, pages 4356–4365.
    Google ScholarLocate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
    Findings
  • Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-BERT: Enabling language representation with knowledge graph. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang. 2018. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the ICCL, pages 1952– 1962.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, and Lijiao Yang. 2018. Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In Proceedings of CCL, pages 209–221. Springer.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
    Findings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS EMC2 Workshop.
    Google ScholarLocate open access versionFindings
  • Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a. Patient knowledge distillation for bert model compression. In Proceedings of EMNLP-IJCNLP, pages 4314–4323.
    Google ScholarLocate open access versionFindings
  • Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019b. ERNIE: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223.
    Findings
  • Surat Teerapittayanon, Bradley McDanel, and HsiangTsung Kung. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In Proceedings of ICPR, pages 2464–2469.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in NeurIPS, pages 5998– 6008.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of EMNLP, pages 353–355.
    Google ScholarLocate open access versionFindings
  • Baoxin Wang. 2018. Disconnected recurrent neural networks for text categorization. In Proceedings of ACL, pages 2311–2320.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in NeurIPS, pages 649–657.
    Google ScholarLocate open access versionFindings
  • Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao, Tao Liu, Wei Lu, Xi Chen, Haotang Deng, Qi Ju, and Xiaoyong Du. 2019. UER: An open-source toolkit for pre-training models. In Proceedings of EMNLPIJCNLP 2019, page 241.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments