Improving Transformer Models by Reordering their Sublayers

ACL, pp. 2996-3005, 2020.

Cited by: 2|Bibtex|Views84
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Sandwich ordering does not improve translation models, we show that they are robust to layer order changes, and that even extreme reorderings perform as well as the baseline

Abstract:

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern achieve better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better pe...More
Introduction
  • The transformer layer (Vaswani et al, 2017) is currently the primary modeling component in natural language processing, playing a lead role in recent innovations such as BERT (Devlin et al, 2019) and GPT-2 (Radford et al, 2019).
  • The authors generate random transformer models, varying the number of each type of sublayer, and their ordering, while keeping the number of parameters constant.
  • The authors train these models on the standard WikiText-103 word-level language modeling benchmark (Merity et al, 2016), and observe that some of these random models outperform the original interleaved transformer model, even when the number of self-attention and feedforward layers is not equal.
  • The authors' analysis shows that models with more self-attention toward the bottom and more feedforward sublayers toward the top tend to perform better in general
Highlights
  • The transformer layer (Vaswani et al, 2017) is currently the primary modeling component in natural language processing, playing a lead role in recent innovations such as BERT (Devlin et al, 2019) and GPT-2 (Radford et al, 2019)
  • Each transformer layer consists of a self-attention sublayer followed by a feedforward sublayer, modifying a sequence of vectors X0 as follows
  • We show that the sandwich ordering improves language modeling performance on a different word-level language modeling benchmark, and that the sandwich pattern can be used to achieve state of the art results on character-level language modeling
  • Sandwich ordering does not improve translation models, we show that they are robust to layer order changes, and that even extreme reorderings perform as well as the baseline
  • Sublayer reordering can improve the performance of transformer models, but an ordering that improves models on one group of tasks might not improve the performance on another task
Methods
  • The authors' baseline is the strong transformer language model of Baevski and Auli (2019), trained on WikiText-103 (Merity et al, 2016).
  • The Baevski sfffssfsfsfssffffsfsffsffffff sffssfsssssssssssssfsfsssfsffsssfsssfs ssssssffsffffssfffffsssfsfsssssssss fffffffffsffssffsffssssfsfsssf fssfsssffffffssfsssfsfffssssfsfss sffsffffffsfsfssfsssfsfsfssfssfs sffssffsfffsfsfssssffffffssssff fsffsfssffffsfsfffsfffssfffsss sffsffssffsfsffsssfsssssfsssfffsss ssfffffffssfffssfssffsfsfsffsf sfsfsfffsfffssfsfffsffssfsfsfss sfsffsssffsffsssfssfffffssssfsssf sffsfssfffsffsfssssfsfsffffsfsss sffffsffssssfsssfssfffsssfssssfsfs fsssffssssssfsfsfsffsffffssfsfssss sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf fssssssfsfsfsfffsfsssfssffssssfsff sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf ssfsfsssfsssssffsfsfsssfssfsfsssssssf sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf sssfsffsfssfsssffsffffffssfsfff sssfsfsffsssfsfffffsfsffffsssff sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf ssssssfsssffffsfsfffffffffffsf and Auli model contains 16 transformer layers of d = 1024 dimensions, with 16 heads in each self-attention sublayer, and feedforward sublayers with an inner dimension of 4096.
  • To set an accurate baseline, the authors train the baseline model with five different random seeds, achieving 18.65 ± 0.24 perplexity on the development set
Results
  • The authors find that using the most extreme sandwich decoder6f6 performs almost exactly the same as the average baseline; this result is consistent with the observation from Section 4, where the authors show that the extreme sandwich language model (s16f16) performs as well as the baseline
Conclusion
  • This experiment indicates that a reordering pattern that benefits one particular task might not carry the same performance gains to another
  • It demonstrates the general robustness of transformer architectures to sublayer reordering, as the authors did not observe any major perfor-.
  • On average, better models contain more self-attention sublayers at the bottom and more feedforward sublayer at the top
  • This leads them to design a new transformer stack, the sandwich transformer, which significantly improves performance over the baseline at no cost in parameters, memory, or runtime.
  • By showing that sublayer ordering can improve models at no extra cost, the authors hope that future research continues this line of work by looking into optimal sublayer ordering for other tasks, such as translation, question answering, and classification
Summary
  • Introduction:

    The transformer layer (Vaswani et al, 2017) is currently the primary modeling component in natural language processing, playing a lead role in recent innovations such as BERT (Devlin et al, 2019) and GPT-2 (Radford et al, 2019).
  • The authors generate random transformer models, varying the number of each type of sublayer, and their ordering, while keeping the number of parameters constant.
  • The authors train these models on the standard WikiText-103 word-level language modeling benchmark (Merity et al, 2016), and observe that some of these random models outperform the original interleaved transformer model, even when the number of self-attention and feedforward layers is not equal.
  • The authors' analysis shows that models with more self-attention toward the bottom and more feedforward sublayers toward the top tend to perform better in general
  • Methods:

    The authors' baseline is the strong transformer language model of Baevski and Auli (2019), trained on WikiText-103 (Merity et al, 2016).
  • The Baevski sfffssfsfsfssffffsfsffsffffff sffssfsssssssssssssfsfsssfsffsssfsssfs ssssssffsffffssfffffsssfsfsssssssss fffffffffsffssffsffssssfsfsssf fssfsssffffffssfsssfsfffssssfsfss sffsffffffsfsfssfsssfsfsfssfssfs sffssffsfffsfsfssssffffffssssff fsffsfssffffsfsfffsfffssfffsss sffsffssffsfsffsssfsssssfsssfffsss ssfffffffssfffssfssffsfsfsffsf sfsfsfffsfffssfsfffsffssfsfsfss sfsffsssffsffsssfssfffffssssfsssf sffsfssfffsffsfssssfsfsffffsfsss sffffsffssssfsssfssfffsssfssssfsfs fsssffssssssfsfsfsffsffffssfsfssss sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf fssssssfsfsfsfffsfsssfssffssssfsff sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf ssfsfsssfsssssffsfsfsssfssfsfsssssssf sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf sssfsffsfssfsssffsffffffssfsfff sssfsfsffsssfsfffffsfsffffsssff sfsfsfsfsfsfsfsfsfsfsfsfsfsfsfsf ssssssfsssffffsfsfffffffffffsf and Auli model contains 16 transformer layers of d = 1024 dimensions, with 16 heads in each self-attention sublayer, and feedforward sublayers with an inner dimension of 4096.
  • To set an accurate baseline, the authors train the baseline model with five different random seeds, achieving 18.65 ± 0.24 perplexity on the development set
  • Results:

    The authors find that using the most extreme sandwich decoder6f6 performs almost exactly the same as the average baseline; this result is consistent with the observation from Section 4, where the authors show that the extreme sandwich language model (s16f16) performs as well as the baseline
  • Conclusion:

    This experiment indicates that a reordering pattern that benefits one particular task might not carry the same performance gains to another
  • It demonstrates the general robustness of transformer architectures to sublayer reordering, as the authors did not observe any major perfor-.
  • On average, better models contain more self-attention sublayers at the bottom and more feedforward sublayer at the top
  • This leads them to design a new transformer stack, the sandwich transformer, which significantly improves performance over the baseline at no cost in parameters, memory, or runtime.
  • By showing that sublayer ordering can improve models at no extra cost, the authors hope that future research continues this line of work by looking into optimal sublayer ordering for other tasks, such as translation, question answering, and classification
Tables
  • Table1: Randomly generated models with 16 selfattention (s) sublayers and 16 feedforward (f) sublayers, and their perplexity on the WikiText-103 development set. The baselines (the standard transformer trained with different random seeds) are in bold
  • Table2: Randomly generated models with the same number of parameters as the baseline, and their perplexity on the WikiText-103 development set. The baselines (the standard transformer trained with different random seeds) are in bold
  • Table3: Performance on the WikiText-103 test set. We compare the best sandwich transformer to the unmodified, interleaved transformer baseline (<a class="ref-link" id="cBaevski_2019_a" href="#rBaevski_2019_a">Baevski and Auli, 2019</a>) trained over 5 random seeds and to other previously reported results
  • Table4: Performance on the Toronto Books Corpus language modeling test set. The baseline model (<a class="ref-link" id="cBaevski_2019_a" href="#rBaevski_2019_a">Baevski and Auli, 2019</a>) is trained over 5 random seeds. The sandwich coefficient is tuned on the validation set and we run our model on the test set only once
  • Table5: Performance on character-level language modeling, evaluated on the enwik8 and text8 test sets. The baseline model (<a class="ref-link" id="cSukhbaatar_et+al_2019_a" href="#rSukhbaatar_et+al_2019_a">Sukhbaatar et al, 2019</a>) is trained over 5 random seeds. The sandwich coefficient is tuned on each benchmark’s validation set, and we run our model on the test only once
  • Table6: BLEU on newstest2014 En-De. Our encoder (decoder) sandwich model keeps the decoder (encoder) unmodified. We train the baseline model (Transformerlarge with the hyperparameters of Ott et al, 2018) 5 times with different random seeds
  • Table7: The average attention distance, on the WikiText-103 validation dataset, of each model pair. Since there are two baselines and two sandwich transformers (initialized with different random seeds), the distance between the baseline and sandwich models is averaged over all four baseline-sandwich combinations
Download tables as Excel
Related work
  • 7.1 Neural Architecture Search

    In this paper, we manually search through a constrained transformer architecture space, after analyzing the results of two small-scale random searches. This human-in-the-loop method for architecture search has advantages over previous methods (Jozefowicz et al, 2015; Zoph and Le, 2016; Tan and Le, 2019) since it requires that only a few dozen models be trained, unlike typical architecture search methods that require training thousands of instances, consuming massive computational resources.

    While we do find a better performing transformer, our goal is not only to do so, but to better understand how sublayer ordering affects transformer models. Future work could apply methods from the architecture space literature to the sub-

    7.2 Transformer Modifications

    Much recent work has been devoted to improving transformers by modifying their sublayers. This includes sparsifying their attention patterns, either in an input-based manner (as in Correia et al, 2019), or in a static manner (as in Guo et al, 2019). So et al (2019) proposed modifying the transformer by adding convolutions and changing the activation function, while others have demonstrated that different initialization schemes (Zhang et al, 2019) and repositioning the layer normalization (Nguyen and Salazar, 2019) can also have a positive effect on performance.
Reference
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450.
    Findings
  • Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In ICLR.
    Google ScholarFindings
  • Goncalo M. Correia, Vlad Niculae, and Andre F. T. Martins. 2019. Adaptively sparse transformers. arXiv:1909.00015.
    Findings
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Startransformer. In NAACL.
    Google ScholarFindings
  • Hakan Inan, Khashayar Khosravi, and Richard Socher. 201Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR.
    Google ScholarFindings
  • Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In ICLR.
    Google ScholarFindings
  • Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 201Generalization
    Google ScholarFindings
  • Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning.
    Google ScholarFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
    Google ScholarFindings
  • Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In ICLR.
    Google ScholarFindings
  • Victor Sanh. 2019.
    Google ScholarFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
    Google ScholarFindings
  • David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In ICML.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958.
    Google ScholarLocate open access versionFindings
  • Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In ACL.
    Google ScholarFindings
  • Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
    Google ScholarFindings
  • Biao Zhang, Ivan Titov, and Rico Sennrich. 2019. Improving deep transformer with depth-scaled initialization and merged attention. arXiv:1908.11365.
    Findings
  • Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. arXiv:1506.06724.
    Findings
  • Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. arXiv:1611.01578.
    Findings
Full Text
Your rating :
0

 

Tags
Comments