AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We show that the compound growth method achieves better performance than single-dimensional
On the Transformer Growth for Progressive BERT Training
NAACL-HLT, pp.5174-5180, (2021)
As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively--start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles t...More
PPT (Upload PPT)
- Thanks to the increasing computational power, pre-trained language models have been breaking the glass ceiling for natural language processing tasks (Peters et al, 2018; Devlin et al, 2019; Liu et al, 2019; Brown et al, 2020).
- Two components are needed for designing such progressive training algorithms–the growth scheduler and the growth operator (Dong et al, 2020).
- The former controls when to conduct network growth, and the latter controls how to perform network growth.
- The authors' objectives are to better understand growth operators for Transformer models, and to help design better progressive algorithms for BERT (Devlin et al, 2019) pre-training
- Thanks to the increasing computational power, pre-trained language models have been breaking the glass ceiling for natural language processing tasks (Peters et al, 2018; Devlin et al, 2019; Liu et al, 2019; Brown et al, 2020)
- We show that growing a Transformer from both dimensions leads to better performance with less training cost, which verifies our intuitions and shows the potential of using compound growth operators in progressive BERT training
- On the BERT-large model, stacking and CompoundGrow speeds up pre-training by 70.7% and 111.4% respectively in FLOPs, 69.7% and 82.2% respectively on walltime
- We show that the compound growth method achieves better performance than single-dimensional
- It remains an open research direction to study the relationships between different operators and explore effective schedulers to coordinate different training stages of progressive training
- Previous studies have rarely focused on progressive Transformer growth for BERT training, and progressive Transformer stacking (Gong et al, 2019) is the only directly comparable method to the best of the knowledge.
- The authors apply their method on the official BERT model.
- The new training schedule is much faster than the reported one and still gives better final performance than the original paper.
- This is the fastest stacking model the authors can get without performance drop.
- The authors train the full model for 300K steps, just like the compared method
- On the BERT-base model, stacking and CompoundGrow speeds up pre-training by 68.7% and 107.1% respectively in FLOPs, 64.9% and 73.6% respectively on walltime.
- On the BERT-large model, stacking and CompoundGrow speeds up pre-training by 70.7% and 111.4% respectively in FLOPs, 69.7% and 82.2% respectively on walltime.
- Both compared methods achieve at least the same performance as the original BERT model.
- On the base model, stacking is better in terms of average GLUE score, mainly due to its advantage on the CoLA dataset.
- Such an unusual gap on CoLA might be caused by its relatively small volume and corresponding random variance (Dodge et al, 2020).
- On the larger and more robust MNLI dataset, the compared methods achieve almost the same score
- In this work the authors empirically verify the compound effect for Transformer growth. Different from previous works, the authors propose to grow a low-cost Transformer model from more than one dimension.
- The authors show that the compound growth method achieves better performance than single-dimensional.
- The authors apply controlled method to compare available growth operators on different dimensions to provide practical guidance in operator selection.
- The authors' final model speeds up the training of the BERT-base and BERT-large model by 73.6% and 82.2% in walltime respectively while achieving comparable performance.
- The study of compound growth leaves substantial space for future improvement, especially on the design of growth operators on different dimensions.
- From another perspective, it remains an open research direction to study the relationships between different operators and explore effective schedulers to coordinate different training stages of progressive training
- Table1: Empirical comparison among growing operators. For each operator, a low-cost model is first trained for 700K steps, then grown to the original BERT model for another 300K steps training
- Table2: The pre-training speedup and finetuning performance on dev sets of MNLI and SQuaD. M/MM stands for matched/mismatched accuracy for MNLI. EM/F1 represents exact match score and F1 score for SQuaD. The FLOPs are estimated for forward pass operations, while the walltime is real training time profiled by the TensorFlow profiler from a distributed multi-host setting
- Table3: The test performance on the GLUE benchmark with metrics described in the original paper (Wang et al, 2018), the higher the better. Compound stands for the proposed method
- Progressive training was originally proposed to improve training stability, which starts from an efficient and small model and gradually increase the model capacity (Simonyan & Zisserman, 2014). Recent study leverages this paradigm to accelerate model training. For example, multi-level residual network (Chang et al, 2018) explores the possibility of augmenting network depth in a dynamic system of view and transforms each layer into two subsequent layers. AutoGrow (Wen et al, 2019) attempts to automate the discover of proper depth to achieve near-optimal performance on different datasets. LipGrow (Dong et al, 2020) proposes a learning algorithm with an automatic growing scheduler for convolution nets. At the same time, many studies have been conducted on the model growing operators. Network Morphism (Wei et al, 2016; 2017) manages to grow a layer to multiple layers with the represented function intact. Net2net (Chen et al, 2015) is a successful application to transfer knowledge to a wider network with function-preserving initialization. Similar ideas can also be discovered in many network architectures, including progressive growing of GAN (Karras et al, 2017) and Adaptive Computation Time (Graves, 2016; Jernite et al, 2016).
- Devlin et al (2019) designs two-stage training with a reduced sequence length for the first 90% of updates
- Gong et al (2019) stack shallow model trained weights to initialize a deeper model, which grows the BERT-base model on the depth dimension and achieves 25% shorter training time
- T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, P. Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, G. Kruger, Tom Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J. Clark, Christopher Berner, Sam McCandlish, A. Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id = SyJS-OgR-.
- Chen Chen, Xianzhi Du, Le Hou, Jaeyoun Kim, Pengchong Jin, Jing Li, Yeqing Li, Abdullah Rashwan, and Hongkun Yu. Tensorflow official model garden, 2020. URL https://github.com/tensorflow/models/tree/master/official.
- Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Michael Schuster, Zhi-Feng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation. In ACL, 2018.
- Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer, 2015.
- Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. arXiv preprint arXiv:2006.03236, 2020.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
- Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
- Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards adaptive residual network training: A neural-ode perspective. In Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML 2020), July 2020.
- Linyuan Gong, D. He, Zhuohan Li, T. Qin, Liwei Wang, and T. Liu. Efficient training of bert by progressively stacking. In ICML, 2019.
- Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
- Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016.
- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
- Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ArXiv, abs/1802.05365, 2018.
- A. Radford. Improving language understanding by generative pre-training. 2018. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. M. Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks.
- ArXiv, abs/1905.11946, 2019. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018. Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism. In International Conference on Machine Learning, pp. 564–572, 2016. Tao Wei, Changhu Wang, and Chang Wen Chen. Modularized morphing of neural networks. arXiv preprint arXiv:1701.03281, 2017. Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 20Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, and Philipp Krahenbuhl. A multigrid method for efficiently training video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 153–162, 2020.