# Depth-Adaptive Transformer

ICLR, 2020.

EI

Weibo:

Abstract:

State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict ...More

Introduction

- The size of modern neural sequence models (Gehring et al, 2017; Vaswani et al, 2017; Devlin et al, 2019) can amount to billions of parameters (Radford et al, 2019).
- Current models apply the same amount of computation regardless of whether the input is easy or hard.
- The authors extend Graves (2016; ACT) who introduced dynamic computation to recurrent neural networks in several ways: the authors apply different layers at each stage, the authors investigate a range of designs and training targets for the halting module and the authors explicitly supervise through simple oracles to achieve good performance on large-scale tasks

Highlights

- The size of modern neural sequence models (Gehring et al, 2017; Vaswani et al, 2017; Devlin et al, 2019) can amount to billions of parameters (Radford et al, 2019)
- We extend Graves (2016; ACT) who introduced dynamic computation to recurrent neural networks in several ways: we apply different layers at each stage, we investigate a range of designs and training targets for the halting module and we explicitly supervise through simple oracles to achieve good performance on large-scale tasks
- We present a variety of mechanisms to predict the decoder block at which the model will stop and output the next token, or when it should exit to achieve a good speed-accuracy trade-off
- All parameters are tuned on the valid set and we report results on the test set for a range of average exits
- We extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network
- Results show that the number of decoder layers can be reduced by more than three quarters at no loss in accuracy compared to a well tuned Transformer baseline

Methods

- The authors evaluate on several benchmarks and measure tokenized BLEU (Papineni et al, 2002): IWSLT’14 German to English (De-En).
- The authors use the setup of Edunov et al (2018) and train on 160K sentence pairs.
- The authors use N = 6 blocks, a feed-forward network of intermediate-dimension Baseline.
- Aligned Mixed M = 1 Mixed M = 3 Mixed M = 6 Uniform.

Results

- The exit distribution for a given sample can give insights into what a Depth-Adaptive Transformer decoder considers to be a difficult task.
- For each hypothesis y, the authors will look at the sequence of selected exits (n1, .
- N|y|) and the probability scores (p1, .
- P|y|) with pt = p i.e. the confidence of the model in the sampled token at the selected exit

Conclusion

- The authors extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network.
- The authors compared a number of different mechanisms to predict the required network depth and find that a simple correctness based geometric-like classifier obtains the best trade-off between speed and accuracy.
- Results show that the number of decoder layers can be reduced by more than three quarters at no loss in accuracy compared to a well tuned Transformer baseline

Summary

## Introduction:

The size of modern neural sequence models (Gehring et al, 2017; Vaswani et al, 2017; Devlin et al, 2019) can amount to billions of parameters (Radford et al, 2019).- Current models apply the same amount of computation regardless of whether the input is easy or hard.
- The authors extend Graves (2016; ACT) who introduced dynamic computation to recurrent neural networks in several ways: the authors apply different layers at each stage, the authors investigate a range of designs and training targets for the halting module and the authors explicitly supervise through simple oracles to achieve good performance on large-scale tasks
## Methods:

The authors evaluate on several benchmarks and measure tokenized BLEU (Papineni et al, 2002): IWSLT’14 German to English (De-En).- The authors use the setup of Edunov et al (2018) and train on 160K sentence pairs.
- The authors use N = 6 blocks, a feed-forward network of intermediate-dimension Baseline.
- Aligned Mixed M = 1 Mixed M = 3 Mixed M = 6 Uniform.
## Results:

The exit distribution for a given sample can give insights into what a Depth-Adaptive Transformer decoder considers to be a difficult task.- For each hypothesis y, the authors will look at the sequence of selected exits (n1, .
- N|y|) and the probability scores (p1, .
- P|y|) with pt = p i.e. the confidence of the model in the sampled token at the selected exit
## Conclusion:

The authors extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network.- The authors compared a number of different mechanisms to predict the required network depth and find that a simple correctness based geometric-like classifier obtains the best trade-off between speed and accuracy.
- Results show that the number of decoder layers can be reduced by more than three quarters at no loss in accuracy compared to a well tuned Transformer baseline

- Table1: Aligned vs. mixed training on IWSLT De-En. We report valid BLEU for a uniformly sampled exit n ∼ U([1..6]) at each token, a fixed exit n ∈ [1..6] for all tokens, as well as the average BLEU over the fixed exits. As baseline we show six standard Transformer models with 1-6 blocks
- Table2: Aligned training with different weights (ωn) on IWSLT De-En. For each model we report BLEU on the dev set evaluated with a uniformly sampled exit n ∼ U([1..6]) for each token and a fixed exit n ∈ [1..6] throughout the sequence. The average corresponds to the average BLEU over the fixed exits
- Table3: Aligned training with different gradient scaling ratios γ : 1 on IWSLT’14 De-En. For each model we report the BLEU4 score evaluated with a uniformly sampled exit n ∼ U([1..6]) for each token and a fixed exit n ∈ [1..6]. The average corresponds to the average BLEU4 of all fixed exits
- Table4: FLOPS of basic operations, key parameters and variables for the FLOPS estimation

Funding

- Trains Transformer models which can make output predictions at different stages of the network and investigates different ways to predict how much computation is required for a particular sequence
- Proposes Transformers which adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time
- Extends Graves who introduced dynamic computation to recurrent neural networks in several ways: differents layers at each stage, investigates a range of designs and training targets for the halting module and supervises through simple oracles to achieve good performance on large-scale tasks
- Considers a variety of mechanisms to estimate the network depth and applies a different layer at each step
- Extends the resource efficient object classification work of Huang et al. and Bolukbasi et al. to structured prediction where dynamic computation decisions impact future computation
- Can match the performance of well tuned baseline models at up to 76% less computation

Reference

- Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for efficient inference. In Proc. of ICML, 2017.
- M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico. Report on the 11th iwslt evaluation campaign. In IWSLT, 2014.
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In Proc. of ICLR, 2018.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, 2019.
- Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical structured prediction losses for sequence to sequence learning. In Proc. of NAACL, 2018.
- Michael Figurnov, Artem Sobolev, and Dmitry P. Vetrov. Probabilistic adaptive computation time. In ArXiv preprint, 2017.
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proc. of ICML, 2017.
- Alex Graves. Adaptive computation time for recurrent neural networks. In ArXiv preprint, 2016.
- Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. In Proc. of ICLR, 2017.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015. Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook fair’s wmt19 news translation task submission. In Proc. of WMT, 2019. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. Fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL, 2019.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL, 2002. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In Technical report, OpenAI., 2019.
- R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proc. of ACL, 2016. Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In ICPR, 2016.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. of NeurIPS, 2017. Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proc. of ECCV, 2018.

Full Text

Tags

Comments