Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

NIPS 2020, (2020)

被引用0|浏览42
EI
下载 PDF 全文
引用
微博一下

摘要

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current methods for accelerating the pre-training either rely on massive parallelism with advanced hardware or a...更多

代码

数据

0
简介
  • Natural language processing (NLP) tasks, such as natural language inference [1, 2] and question answering [3,4,5], have achieved great success with the development of neural networks.
  • Each Transformer layer encodes the the input of the i-th Transformer layer xi with hi = fLN (xi + fS−AT T N (xi)), which is a multi-head self-attention sub-layer fAT T N , and by xi+1 = fLN (hi + fF F N), which is a feed-forward network fF F N , where xi+1 is the output of the i-th Transformer layer
  • Both sub-layers have an AddNorm operation that consists a residual connection [28] and a layer normalization [29].
  • The BERT model recursively applies the transformer block to the input to get the output
重点内容
  • Natural language processing (NLP) tasks, such as natural language inference [1, 2] and question answering [3,4,5], have achieved great success with the development of neural networks
  • Due to the exciting prospect, pre-training Transformer networks with a large corpus of text followed by fine-tuning on specific tasks has become a new paradigm for natural language processing
  • We find that both the choice of Transformer architecture as well as training dynamics would have a big impact on layer dropping. (ii) We propose a new architecture unit, called the Switchable-Transformer (ST) block, that allows switching on/off a Transformer layer for only a set portion of the training schedule, excluding them from both forward and backward pass and stabilizes Transformer network training. (iii) We further propose a progressive schedule to add extra-stableness for pre-training Transformer networks with layer dropping – our schedule smoothly increases the layer dropping rate for each mini-batch as training evolves by adapting in time the parameter of the Bernoulli distribution used for sampling
  • Unsupervised language model pre-training is a crucial step for getting state-of-the-art performance on NLP tasks
  • We study the efficient training algorithms for pre-training BERT model for NLP tasks
  • BERT + PreLN + lr↑ + progressive layer dropping (PLD) 69.0 88.9/86.5 89.6/89.1 59.4 91.8 88.0 89.4/90.9 83.1/83.5 83.2 better than PreLN while being 24% faster, indicating the strong regularization effect from stochastic depth
  • We propose the Switchable-Transformer block and a progressive layer-wise drop schedule
结果
  • With lrmax=1e−4, the convergence rate of the algorithm and the baseline is very close.
  • The authors' method shows a healthy convergence curve and is much faster
  • This confirms that the architectural changes stabilize training and allows BERT training with more aggressive learning rates.
  • Fig. 9, which shows that the baseline is less robust on the choice of learning rates.
  • PLD is more robust and often achieves better results with large learning rates
结论
  • Unsupervised language model pre-training is a crucial step for getting state-of-the-art performance on NLP tasks.
  • The authors study the efficient training algorithms for pre-training BERT model for NLP tasks.
  • The authors have conducted extensive analysis and found that model architecture is important when training Transformer-based models with stochastic depth.
  • Using this insight, the authors propose the Switchable-Transformer block and a progressive layer-wise drop schedule.
  • The authors' experiment results show that the training strategy achieves competitive performance to training a deep model from scratch at a faster rate
总结
  • Introduction:

    Natural language processing (NLP) tasks, such as natural language inference [1, 2] and question answering [3,4,5], have achieved great success with the development of neural networks.
  • Each Transformer layer encodes the the input of the i-th Transformer layer xi with hi = fLN (xi + fS−AT T N (xi)), which is a multi-head self-attention sub-layer fAT T N , and by xi+1 = fLN (hi + fF F N), which is a feed-forward network fF F N , where xi+1 is the output of the i-th Transformer layer
  • Both sub-layers have an AddNorm operation that consists a residual connection [28] and a layer normalization [29].
  • The BERT model recursively applies the transformer block to the input to get the output
  • Objectives:

    The authors' goal is to measure how effective these two methods at stabilizing BERT training.
  • Results:

    With lrmax=1e−4, the convergence rate of the algorithm and the baseline is very close.
  • The authors' method shows a healthy convergence curve and is much faster
  • This confirms that the architectural changes stabilize training and allows BERT training with more aggressive learning rates.
  • Fig. 9, which shows that the baseline is less robust on the choice of learning rates.
  • PLD is more robust and often achieves better results with large learning rates
  • Conclusion:

    Unsupervised language model pre-training is a crucial step for getting state-of-the-art performance on NLP tasks.
  • The authors study the efficient training algorithms for pre-training BERT model for NLP tasks.
  • The authors have conducted extensive analysis and found that model architecture is important when training Transformer-based models with stochastic depth.
  • Using this insight, the authors propose the Switchable-Transformer block and a progressive layer-wise drop schedule.
  • The authors' experiment results show that the training strategy achieves competitive performance to training a deep model from scratch at a faster rate
表格
  • Table1: Training time comparison. Sample RD standards for sample reduction. SPD represents speedup
  • Table2: The results on the GLUE benchmark. The number below each task denotes the number of training examples. The metrics for these tasks can be found in the GLUE paper [<a class="ref-link" id="c6" href="#r6">6</a>]. We compute the geometric mean of the metrics as the GLUE score
  • Table3: Ablation studies of the fine-tuning results on the GLUE benchmark
  • Table4: Hyperparameters for pre-training the baseline and PLD
Download tables as Excel
基金
  • Extensive experiments on BERT show that the proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline to get a similar accuracy on downstream tasks
  • Furthermore, we evaluate the generalizability of models pre-trained with the same number of samples as the baseline, and we observe that while faster to train, our approach achieves a 1.1% higher GLUE score than the baseline, indicating a strong knowledge transferability
  • PLD reaches the same validation loss at epoch 87, with 53% fewer training samples
  • Furthermore, PLD achieves a 24% time reduction when training the same number of samples. This is because our approach trains the model with a smaller number of expected depth for the same number of steps. It is slightly lower than the 25% GFLOPS reduction in the analysis because the output layer still takes a small amount of computation even after optimizations
  • When trained with the large learning rate as PLD, PreLN’s result have improved to 82.6 but is 0.6 points worse than PLD (83.2), despite using 24% more compute resource
  • BERT + PreLN + lr↑ + PLD 69.0 88.9/86.5 89.6/89.1 59.4 91.8 88.0 89.4/90.9 83.1/83.5 83.2 better than PreLN while being 24% faster, indicating the strong regularization effect from stochastic depth
引用论文
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 5754–5764, 2019.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4149–4158, 2019.
    Google ScholarLocate open access versionFindings
  • Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. End-toend open-domain question answering with bertserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Demonstrations, pages 72–77, 2019.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2694–2703, 2019.
    Google ScholarLocate open access versionFindings
  • Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
    Findings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Online post, 2019.
    Google ScholarLocate open access versionFindings
  • Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
    Findings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. CoRR, abs/1907.10529, 2019.
    Findings
  • Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinform., 36(4):1234–1240, 2020.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054, 2019.
    Google ScholarLocate open access versionFindings
  • Turing-NLG: A
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [18] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
    Google ScholarLocate open access versionFindings
  • [19] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. NVIDIA tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2018, Vancouver, BC, Canada, May 21-25, 2018, pages 522–531, 2018.
    Google ScholarLocate open access versionFindings
  • [20] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 103–112, 2019.
    Google ScholarLocate open access versionFindings
  • [21] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. Meshtensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, pages 10435–10444, 2018.
    Google ScholarLocate open access versionFindings
  • [22] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
    Google ScholarLocate open access versionFindings
  • [23] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.
    Findings
  • [24] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
    Findings
  • [25] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling BERT for natural language understanding. CoRR, abs/1909.10351, 2019.
    Findings
  • [26] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pages 646–661, 2016.
    Google ScholarLocate open access versionFindings
  • [27] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
    Google ScholarLocate open access versionFindings
  • [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • [29] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
    Findings
  • [30] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
    Google ScholarLocate open access versionFindings
  • [31] Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tie-Yan Liu. Efficient training of BERT by progressively stacking. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 2337–2346, 2019.
    Google ScholarLocate open access versionFindings
  • [32] Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
    Google ScholarLocate open access versionFindings
  • [33] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1810–1822, 2019.
    Google ScholarLocate open access versionFindings
  • [34] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745, 2020.
    Findings
  • [35] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In 7th International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • [36] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
    Findings
  • [37] Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of selfattention. CoRR, abs/1910.05895, 2019.
    Findings
  • [38] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 249–256, 2010.
    Google ScholarLocate open access versionFindings
  • [39] Alireza Zaeemzadeh, Nazanin Rahnavard, and Mubarak Shah. Norm-preservation: Why residual networks can become extremely deep? CoRR, abs/1805.07477, 2018.
    Findings
  • [40] Klaus Greff, Rupesh Kumar Srivastava, and Jürgen Schmidhuber. Highway and residual networks learn unrolled iterative estimation. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
    Google ScholarLocate open access versionFindings
  • [41] Pietro Morerio, Jacopo Cavazza, Riccardo Volpi, René Vidal, and Vittorio Murino. Curriculum dropout. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3564–3572, 2017.
    Google ScholarLocate open access versionFindings
  • [42] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 41–48, 2009.
    Google ScholarLocate open access versionFindings
  • [43] PyTorch Distributed Data Parallel. https://pytorch.org/docs/stable/notes/ddp.html. Accessed:28-April-2020.
    Findings
  • [44] Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. CoRR, abs/2004.08249, 2020.
    Findings
  • [45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision - ECCV 2016 - 14th European Conference, pages 630–645, 2016.
    Google ScholarLocate open access versionFindings
  • [46] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res., 18:153:1–153:43, 2017.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
小科