AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training

Understanding the Difficulty of Training Transformers

EMNLP 2020, pp.5747-5763, (2020)

Cited by: 21|Views810
Full Text
Bibtex
Weibo

Abstract

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding carefully designing cutting-edge optimizers and learning rate schedulers (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand __what complicates Transformer training__ from b...More

Code:

Data:

0
Introduction
  • Transformers (Vaswani et al, 2017) have led to a series of breakthroughs in various deep learning tasks (Devlin et al, 2019; Velickovic et al, 2018).
  • They do not contain recurrent connections and can parallelize all computations in the same layer, improving effectiveness, efficiency, and scalability.
  • The authors conduct a comprehensive analysis in theoretical and empirical manners to answer the question: what complicates Transformer training
Highlights
  • Transformers (Vaswani et al, 2017) have led to a series of breakthroughs in various deep learning tasks (Devlin et al, 2019; Velickovic et al, 2018)
  • In light of our analysis, we propose Admin, an adaptive initialization method for training Post-LN Transformer models, which retains the merits of Pre-LN stability without hurting the performance
  • Our study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training
Methods
  • BLEU

    Post-LN (Vaswani et al, 2017) DynamicConv (Wu et al, 2019) Post-LN Pre-LN Admin

    Dev PPL on WMT’14 En-De Dev PPL on IWSLT’14 De-En

    5.0 12-Layer Transformer Admin

    Transformer Small tial.
  • As depicted in Figure 1 and Figure 9, the 6-layer Pre-LN converges faster than Post-LN, its final performance is worse than Post-LN.
  • For the IWSLT’14 dataset, the authors use the Transformer-small model for training.
  • The authors find that the attention dropout and the activation dropout have a large impact on the model performance.
  • Via setting the attention dropout ratio and relu dropout ratio to 0.1, the authors are able to improve the Post-LN performance from 34.60 to 35.64
Conclusion
  • The authors study the difficulties of training Transformer in theoretical and empirical manners.
  • The authors' study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training.
  • In light of the analysis, the authors propose Admin, an adaptive initialization method to stabilize Transformers training.
  • It controls the dependency in the beginning of training and maintains the flexibility to capture those dependencies once training stabilizes.
  • Extensive experiments on real world datasets verify the intuitions and show that Admin achieves more stable training, faster convergence, and better performance
Summary
  • Introduction:

    Transformers (Vaswani et al, 2017) have led to a series of breakthroughs in various deep learning tasks (Devlin et al, 2019; Velickovic et al, 2018).
  • They do not contain recurrent connections and can parallelize all computations in the same layer, improving effectiveness, efficiency, and scalability.
  • The authors conduct a comprehensive analysis in theoretical and empirical manners to answer the question: what complicates Transformer training
  • Methods:

    BLEU

    Post-LN (Vaswani et al, 2017) DynamicConv (Wu et al, 2019) Post-LN Pre-LN Admin

    Dev PPL on WMT’14 En-De Dev PPL on IWSLT’14 De-En

    5.0 12-Layer Transformer Admin

    Transformer Small tial.
  • As depicted in Figure 1 and Figure 9, the 6-layer Pre-LN converges faster than Post-LN, its final performance is worse than Post-LN.
  • For the IWSLT’14 dataset, the authors use the Transformer-small model for training.
  • The authors find that the attention dropout and the activation dropout have a large impact on the model performance.
  • Via setting the attention dropout ratio and relu dropout ratio to 0.1, the authors are able to improve the Post-LN performance from 34.60 to 35.64
  • Conclusion:

    The authors study the difficulties of training Transformer in theoretical and empirical manners.
  • The authors' study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training.
  • In light of the analysis, the authors propose Admin, an adaptive initialization method to stabilize Transformers training.
  • It controls the dependency in the beginning of training and maintains the flexibility to capture those dependencies once training stabilizes.
  • Extensive experiments on real world datasets verify the intuitions and show that Admin achieves more stable training, faster convergence, and better performance
Tables
  • Table1: Changing decoders from Post-LN to Pre-LN fixes gradient vanishing, but does not stabilize model training successfully. Encoder/Decoder have 18 layers
  • Table2: Evaluation Results on WMT14 De-En
  • Table3: Performance on IWSLT14 De-En (Transformer models are 6-layer Transformer-small models)
Download tables as Excel
Related work
Reference
  • Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. ArXiv, abs/1607.06450.
    Findings
  • Thomas C. Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. Rezero is all you need: Fast convergence at large depth. ArXiv, abs/2003.04887.
    Findings
  • Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In ICLR.
    Google ScholarFindings
  • David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017a. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML.
    Google ScholarFindings
  • David Balduzzi, Marcus Frean, Lennox Leary, J P Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017b. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML.
    Google ScholarFindings
  • Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks.
    Google ScholarFindings
  • Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Michael Schuster, Zhi-Feng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In ACL.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    Google ScholarFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
    Google ScholarFindings
  • Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017.
    Google ScholarFindings
  • Boris Hanin and David Rolnick. 2018. How to start training: The effect of initialization and architecture. In NeurIPS.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
    Google ScholarFindings
  • Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music transformer: Generating music with long-term structure. In ICLR.
    Google ScholarFindings
  • Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. 2018. Visualizing the loss landscape of neural nets. In NeurIPS.
    Google ScholarFindings
  • Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the variance of the adaptive learning rate and beyond. In ICLR.
    Google ScholarFindings
  • Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2020. Understanding and improving transformer from a multiparticle dynamic system point of view. In ICLR Workshop DeepDiffEq.
    Google ScholarLocate open access versionFindings
  • Dmytro Mishkin and Juan E. Sala Matas. 2015. All you need is a good init. In ICLR.
    Google ScholarFindings
  • Toan Q. Nguyen and Julian Salazar. 20Transformers without tears: Improving the normalization of self-attention. In IWSLT.
    Google ScholarFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
    Google ScholarLocate open access versionFindings
  • Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. 2017. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In NIPS.
    Google ScholarFindings
  • Martin Popel and Ondrej Bojar. 2018. Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics, 110:43 – 70.
    Google ScholarLocate open access versionFindings
  • Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. In NeurIPS.
    Google ScholarFindings
  • Andrew M Saxe, James L McClelland, and Surya Ganguli. 2013a. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
    Findings
  • Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2013b. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120.
    Findings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
    Google ScholarFindings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
    Google ScholarFindings
  • Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning deep transformer models for machine translation. In ACL.
    Google ScholarFindings
  • Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In ICLR.
    Google ScholarFindings
  • Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pennington. 2018. Dynamical isometry and a mean field theory of cnns: How to train 10, 000-layer vanilla convolutional neural networks. In ICML.
    Google ScholarFindings
  • Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shu xin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Li-Wei Wang, and Tie-Yan Liu. 2019. On layer normalization in the transformer architecture. ArXiv, abs/2002.04745.
    Findings
  • Greg Yang and Samuel S. Schoenholz. 2017. Mean field residual networks: On the edge of chaos. In NIPS.
    Google ScholarFindings
  • Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019a. Fixup initialization: Residual learning without normalization. In ICLR.
    Google ScholarFindings
  • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Surinder Kumar, and Suvrit Sra. 2019b. Why adam beats sgd for attention models. ArXiv, abs/1912.03194.
    Findings
  • Following the previous study (Bengio et al., 1994; Glorot and Bengio, 2010; He et al., 2015; Saxe et al., 2013a), we analyze the gradient distribution at the very beginning of training, assume that the randomly initialized parameters and the partial derivative with regard to module inputs are independent.
    Google ScholarFindings
  • 0. Therefore we have Var[∆x(2pi−e)1] + max(x(2oi−e)1W (1), 0)W (2) σb,2i where σb2,2i = Var[b(2oie)]. Referring the dimension of W (1) as D × Df, He et al. (2015) establishes that
    Google ScholarLocate open access versionFindings
  • At initialization, He et al. (2015) establishes that
    Google ScholarFindings
  • Similar to He et al. (2015), we have
    Google ScholarFindings
  • At initialization, we assume ∆x(2oi−e)1 and model parameters are independent (He et al., 2015), thus
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科