Depth-Adaptive Transformer

ICLR, 2020.

Cited by: 3|Bibtex|Views105
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network

Abstract:

State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict ...More

Code:

Data:

Introduction
  • The size of modern neural sequence models (Gehring et al, 2017; Vaswani et al, 2017; Devlin et al, 2019) can amount to billions of parameters (Radford et al, 2019).
  • Current models apply the same amount of computation regardless of whether the input is easy or hard.
  • The authors extend Graves (2016; ACT) who introduced dynamic computation to recurrent neural networks in several ways: the authors apply different layers at each stage, the authors investigate a range of designs and training targets for the halting module and the authors explicitly supervise through simple oracles to achieve good performance on large-scale tasks
Highlights
  • The size of modern neural sequence models (Gehring et al, 2017; Vaswani et al, 2017; Devlin et al, 2019) can amount to billions of parameters (Radford et al, 2019)
  • We extend Graves (2016; ACT) who introduced dynamic computation to recurrent neural networks in several ways: we apply different layers at each stage, we investigate a range of designs and training targets for the halting module and we explicitly supervise through simple oracles to achieve good performance on large-scale tasks
  • We present a variety of mechanisms to predict the decoder block at which the model will stop and output the next token, or when it should exit to achieve a good speed-accuracy trade-off
  • All parameters are tuned on the valid set and we report results on the test set for a range of average exits
  • We extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network
  • Results show that the number of decoder layers can be reduced by more than three quarters at no loss in accuracy compared to a well tuned Transformer baseline
Methods
  • The authors evaluate on several benchmarks and measure tokenized BLEU (Papineni et al, 2002): IWSLT’14 German to English (De-En).
  • The authors use the setup of Edunov et al (2018) and train on 160K sentence pairs.
  • The authors use N = 6 blocks, a feed-forward network of intermediate-dimension Baseline.
  • Aligned Mixed M = 1 Mixed M = 3 Mixed M = 6 Uniform.
Results
  • The exit distribution for a given sample can give insights into what a Depth-Adaptive Transformer decoder considers to be a difficult task.
  • For each hypothesis y, the authors will look at the sequence of selected exits (n1, .
  • N|y|) and the probability scores (p1, .
  • P|y|) with pt = p i.e. the confidence of the model in the sampled token at the selected exit
Conclusion
  • The authors extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network.
  • The authors compared a number of different mechanisms to predict the required network depth and find that a simple correctness based geometric-like classifier obtains the best trade-off between speed and accuracy.
  • Results show that the number of decoder layers can be reduced by more than three quarters at no loss in accuracy compared to a well tuned Transformer baseline
Summary
  • Introduction:

    The size of modern neural sequence models (Gehring et al, 2017; Vaswani et al, 2017; Devlin et al, 2019) can amount to billions of parameters (Radford et al, 2019).
  • Current models apply the same amount of computation regardless of whether the input is easy or hard.
  • The authors extend Graves (2016; ACT) who introduced dynamic computation to recurrent neural networks in several ways: the authors apply different layers at each stage, the authors investigate a range of designs and training targets for the halting module and the authors explicitly supervise through simple oracles to achieve good performance on large-scale tasks
  • Methods:

    The authors evaluate on several benchmarks and measure tokenized BLEU (Papineni et al, 2002): IWSLT’14 German to English (De-En).
  • The authors use the setup of Edunov et al (2018) and train on 160K sentence pairs.
  • The authors use N = 6 blocks, a feed-forward network of intermediate-dimension Baseline.
  • Aligned Mixed M = 1 Mixed M = 3 Mixed M = 6 Uniform.
  • Results:

    The exit distribution for a given sample can give insights into what a Depth-Adaptive Transformer decoder considers to be a difficult task.
  • For each hypothesis y, the authors will look at the sequence of selected exits (n1, .
  • N|y|) and the probability scores (p1, .
  • P|y|) with pt = p i.e. the confidence of the model in the sampled token at the selected exit
  • Conclusion:

    The authors extended anytime prediction to the structured prediction setting and introduced simple but effective methods to equip sequence models to make predictions at different points in the network.
  • The authors compared a number of different mechanisms to predict the required network depth and find that a simple correctness based geometric-like classifier obtains the best trade-off between speed and accuracy.
  • Results show that the number of decoder layers can be reduced by more than three quarters at no loss in accuracy compared to a well tuned Transformer baseline
Tables
  • Table1: Aligned vs. mixed training on IWSLT De-En. We report valid BLEU for a uniformly sampled exit n ∼ U([1..6]) at each token, a fixed exit n ∈ [1..6] for all tokens, as well as the average BLEU over the fixed exits. As baseline we show six standard Transformer models with 1-6 blocks
  • Table2: Aligned training with different weights (ωn) on IWSLT De-En. For each model we report BLEU on the dev set evaluated with a uniformly sampled exit n ∼ U([1..6]) for each token and a fixed exit n ∈ [1..6] throughout the sequence. The average corresponds to the average BLEU over the fixed exits
  • Table3: Aligned training with different gradient scaling ratios γ : 1 on IWSLT’14 De-En. For each model we report the BLEU4 score evaluated with a uniformly sampled exit n ∼ U([1..6]) for each token and a fixed exit n ∈ [1..6]. The average corresponds to the average BLEU4 of all fixed exits
  • Table4: FLOPS of basic operations, key parameters and variables for the FLOPS estimation
Download tables as Excel
Funding
  • Trains Transformer models which can make output predictions at different stages of the network and investigates different ways to predict how much computation is required for a particular sequence
  • Proposes Transformers which adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time
  • Extends Graves who introduced dynamic computation to recurrent neural networks in several ways: differents layers at each stage, investigates a range of designs and training targets for the halting module and supervises through simple oracles to achieve good performance on large-scale tasks
  • Considers a variety of mechanisms to estimate the network depth and applies a different layer at each step
  • Extends the resource efficient object classification work of Huang et al. and Bolukbasi et al. to structured prediction where dynamic computation decisions impact future computation
  • Can match the performance of well tuned baseline models at up to 76% less computation
Reference
  • Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for efficient inference. In Proc. of ICML, 2017.
    Google ScholarLocate open access versionFindings
  • M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico. Report on the 11th iwslt evaluation campaign. In IWSLT, 2014.
    Google ScholarLocate open access versionFindings
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In Proc. of ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical structured prediction losses for sequence to sequence learning. In Proc. of NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Michael Figurnov, Artem Sobolev, and Dmitry P. Vetrov. Probabilistic adaptive computation time. In ArXiv preprint, 2017.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proc. of ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Adaptive computation time for recurrent neural networks. In ArXiv preprint, 2016.
    Google ScholarLocate open access versionFindings
  • Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. In Proc. of ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015. Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook fair’s wmt19 news translation task submission. In Proc. of WMT, 2019. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. Fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL, 2002. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In Technical report, OpenAI., 2019.
    Google ScholarLocate open access versionFindings
  • R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proc. of ACL, 2016. Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In ICPR, 2016.
    Google ScholarFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. of NeurIPS, 2017. Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proc. of ECCV, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments