Deep and light-weight transformer that matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on standard machine translation and language modeling tasks
DeLighT: Deep and Light-weight Transformer
We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation and...更多
下载 PDF 全文
- Attention-based transformer networks (Vaswani et al, 2017) are widely used for sequence modeling tasks, including language modeling and machine translation.
- Since GLTs are local by nature, the DeLighT transformation uses feature shuffling, which is analogous to channel shuffling in convolutional networks (Zhang et al, 2018), to share information between different groups
- Such wide and deep representations facilitate replacing the multi-head attention and feed-forward layers in transformers with single headed attention and light-weight feed-forward layers, reducing total network parameters and operations.
- Unlike transformers, the DeLighT transformation decouples the depth and width from the input size, allowing them to allocate parameters more efficiently across blocks by using shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output
- Attention-based transformer networks (Vaswani et al, 2017) are widely used for sequence modeling tasks, including language modeling and machine translation
- Unlike transformers, the DeLighT transformation decouples the depth and width from the input size, allowing us to allocate parameters more efficiently across blocks by using shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output
- We demonstrate that DeLighT models achieve similar or better performance than transformer models with significantly fewer parameters and operations, on two common sequence modeling tasks, (i) machine translation and (ii) language modeling
- Our results show that (1) shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output deliver the best performance, and (2) models with block-wise scaling coupled with model scaling achieve better performance compared to model scaling alone
- We evaluate the performance of DeLighT on two standard sequence modeling tasks: (1) machine translation (Section 4.1) and (2) language modeling (Section 4.2)
- This paper introduces a deep and light-weight transformer architecture, DeLighT, that efficiently allocates parameters both within the DeLighT block and across DeLighT blocks
- LSTM (Grave et al, 2017b) LSTM + Neural Cache (Grave et al, 2017b) QRNN (Merity et al, 2018b).
- Transformer-XL (Dai et al, 2019) Transformer-XL (The authors' impl.)† Transformer-XL (The authors' impl.)† DeLighT (Ours) Network Depth Context Length.
- # Params – – 151 M.
- (b) Comparison with existing methods Row # R1 R2 R3 R4
- The authors evaluate the performance of DeLighT on two standard sequence modeling tasks: (1) machine translation (Section 4.1) and (2) language modeling (Section 4.2).
4.1 MACHINE TRANSLATION
Datasets and evaluation: The authors benchmark DeLighT models on four datasets: (1) IWSLT’14 German-English (De-En), (2) WMT’16 English-Romanian (En-Ro), (3) WMT’14 English-German (WMT’14 En-De), and (4) WMT’14 English-French (WMT’14 En-Fr).
- For the WMT’14 English-French (En-Fr) dataset, the authors replicate the setup of Gehring et al (2017), which uses 36M/27K/3K sentence pairs for training, validation, and testing respectively with a joint BPE vocabulary size of 44K.
- Table 4a plots the variation of perplexity with number of parameters for DeLighT and TransformerXL (Dai et al, 2019) – which outperforms other transformer-based implementations (e.g., Baevski and Auli 2019)
- Both tables show that DeLighT delivers better performance than state-of-the-art methods and it does this using a smaller context length and significantly fewer parameters, suggesting that the DeLighT transformation helps learn strong contextual relationships.
- The differing settings either hurt performance or increase the parameter count with no further performance gains
- This paper introduces a deep and light-weight transformer architecture, DeLighT, that efficiently allocates parameters both within the DeLighT block and across DeLighT blocks.
- Compared to state-of-the-art transformer models, DeLighT models are (1) deep and light-weight and (2) deliver similar or better performance.
- The authors plan to apply DeLighT to other tasks, including language model pre-training, question answering, and language generation
- Table1: Comparison with baseline transformers on machine translation corpora. DeLighT models require significantly fewer parameters to achieve similar performance. Here, † and ‡ indicate the best reported transformer baselines from <a class="ref-link" id="cWu_et+al_2019_a" href="#rWu_et+al_2019_a">Wu et al (2019</a>) and Ghazvininejad et al (2019), respectively
- Table2: DeLighT networks are deep, lightweight and efficient as compared to transformers. BLEU score is reported on the WMT’14 En-Fr dataset. To compute network depth, we count the number of sequential layers in the network (Section 3.3). We used 20 source and 20 target tokens for computing multiplication-addition operations (MACs). See Appendex C for details
- Table3: Comparison with state-of-the-art methods on machine translation corpora. DeLighT delivers similar or better performance than state-of-the-art models with fewer parameters. Here, † indicates that the network uses neural architecture search (NAS) and ‡ indicates that full network parameters are not reported
- Table4: Results on the WikiText-103 dataset. Compared to Transformer-XL, DeLighT delivers similar or better performance (lower perplexity) with fewer parameters. †For Transformer-XL, we reproduce results using the official source code. For evaluating Transformer-XL with a context length of 480, we set the mem_len hyper-parameter to 480 in the official evaluation scripts
- Table5: Comparison with baseline transformers in terms of training speed and memory consumption. in R4, we implemented CUDA kernels for grouping and ungrouping functions only (see Appendix E). We expect DeLighT to be more efficient with a single and dedicated CUDA kernel for grouping, transformation, feature shuffling, and ungrouping. Memory consumption is measured on a single NVIDIA GP100 GPU (16 GB memory) with a maximum of 4096 tokens per batch and without any gradient accumulation
- Table6: DeLighT requires less regularization as compared to baseline transformers
- Table7: Ablations on different aspects of the DeLighT block, including uniform vs. block-wise scaling, depth scaling, and width scaling. Rows partially highlighted in color have the same configuration (repeated for illustrating results). Our experimental setup is similar to Section 4, except that we train our models for 50K iterations. Multiplication and addition operations (MACs) are computed for 20 time steps
- Table8: Effect of the position of DeLighT transformation. Lower value of perplexity means better performance
- Improving transformers: Several methods have been introduced to improve the transformer architecture. The first line of research addresses the challenge of computing self attention on long input sequences (Child et al, 2019; Kitaev et al, 2020; Beltagy et al, 2020). These methods can be combined with our architecture. The second line of research focuses on explaining multi-head attention (Raganato and Tiedemann, 2018; Brunner et al, 2020). They show that increasing the number of transformer heads can lead to redundant representations (Voita et al, 2019a; Michel et al, 2019) and using fixed attention heads with predefined patterns (Raganato et al, 2020) or synthetic attention matrices (Tay et al, 2020) improves performance. The third line of research focuses on improving transformers by learning better representations (Wu et al, 2019; 2020; So et al, 2019). These works aim to improve the expressiveness of transformers using different transformations – for example, using convolutions (Wu et al, 2019; Gehring et al, 2017), gated linear units (Dauphin et al, 2017), or multi-branch feature extractors (So et al, 2019; Wu et al, 2020). Our work falls into this category. Unlike previous work, we show that it is possible to efficiently allocate parameters both at the block-level using the DeLighT transformation and across blocks using block-wise scaling.
4.1 MACHINE TRANSLATION. Datasets and evaluation: We benchmark DeLighT models on four datasets: (1) IWSLT’14 German-English (De-En), (2) WMT’16 English-Romanian (En-Ro), (3) WMT’14 English-German (WMT’14 En-De), and (4) WMT’14 English-French (WMT’14 En-Fr). For the IWSLT’14 De-En dataset, we replicate the setup of Wu et al (2019) and Edunov et al (2018), which uses 160K/7K/7K sentence pairs for training, validation, and testing with a joint BPE vocabulary of about 10K tokens, respectively
34 DeLighT (Ours). 24 20 40 60 80 100 120 140 Parameters (in million). (a) DeLighT vs. Transformer-XL
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013.
- Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. In International Conference on Learning Representations, 2018a. URL https://openreview.net/forum?id=SyyGPP0TZ.
- Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, and Hannaneh Hajishirzi. Pyramidal recurrent unit for language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Association for Computational Linguistics, 2019.
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
- Alessandro Raganato and Jörg Tiedemann. An analysis of encoder representations in transformer-based machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, November 2018.
- Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. On identifiability in transformers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJg1f6EFDB.
- Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019a.
- Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, pages 14014–14024, 2019.
- Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns in transformerbased machine translation. arXiv preprint arXiv:2002.10260, 2020.
- Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743, 2020.
- Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019.
- Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range attention. In International Conference on Learning Representations, 2020.
- David So, Quoc Le, and Chen Liang. The evolved transformer. In Proceedings of the 36th International Conference on Machine Learning, pages 5877–5886, 2019.
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org, 2017.
- Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 933–941. JMLR. org, 2017.
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019.
- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), August 2016.
- Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019.
- Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. Efficient softmax approximation for GPUs. In International Conference on Machine Learning, 2017a.
- Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, and Hannaneh Hajishirzi. DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling. In International Conference on Learning Representations, 2020.
- Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. Groupreduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, 2018.
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. In Association for Computational Linguistics (ACL), 2020.
- Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference for Representation Learning, 2016.
- Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019b.
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing NeurIPS, 2019.
- DeLighT architectures for language modeling and machine translation are shown in Figure 6. For language modeling, we follow the architecture in Baevski and Auli (2019) while for machine translation, we follow the architecture in Vaswani et al. (2017).
- Language modeling: Figure 6a shows the architecture for language modeling. The architecture stacks B DeLighT blocks, the configuration of each block is determined using block-wise scaling. Each block has three sub-layers. The first layer is a DeLighT transformation that learns representations in high-dimensional space. The second layer is a single-head attention that encodes contextual relationships. The third layer is a position-wise light-weight feed-forward network. Similar to Vaswani et al. (2017), we employ a residual connections (He et al., 2016). Similar to previous works (Baevski and Auli, 2019; Dai et al., 2019), we use tied adaptive input (Baevski and Auli, 2019) and adaptive softmax (Grave et al., 2017a) to map tokens to vectors and vectors to tokens, respectively (Inan et al., 2016; Press and Wolf, 2016).
- Machine translation: Figure 6b shows the architecture for machine translation. The encoder stacks B DeLighT blocks, the configuration of each block is determined using block-wise scaling. Similar to language modeling, each encoder block has three sub-layers. The first layer is a DeLighT transformation that learns representations in high-dimensional space. The second layer is a single-head attention that encodes contextual relationships. The third layer is a position-wise light-weight feed-forward network. Similar to Vaswani et al. (2017), we employ a residual connections (He et al., 2016). We use learnable look-up table to map tokens