Insertion Transformer: Flexible Sequence Generation via Insertion Operations

International Conference on Machine Learning, pp. 5976-5985, 2019.

Cited by: 44|Bibtex|Views103
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We presented the Insertion Transformer, a partially autoregressive model for sequence generation based on insertion operations

Abstract:

We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations. Unlike typical autoregressive models which rely on a fixed, often left-to-right ordering of the output, our approach accommodates arbitrary orderings by allowing for tokens to be inserted anywhere in ...More

Code:

Data:

Introduction
Highlights
  • Neural sequence models (Sutskever et al, 2014; Cho et al, 2014) have been successfully applied to many applications, including machine translation (Bahdanau et al, 2015; Luong et al, 2015), speech recognition (Bahdanau et al, 2016; Chan et al, 2016), speech synthesis (Oord et al, 2016a; Wang et al, 2017), image captioning (Vinyals et al, 2015b; Xu et al, 2015) and image generation (Oord et al, 2016b;c)
  • We present a flexible sequence generation framework based on insertion operations
  • While this paper focuses on sequence generation, we note that our framework can be generalized to higher-dimensional outputs
  • One explanation is that the gradients of the binary tree and uniform losses are much more informative, in that they capture information on all the missing tokens, whereas left-to-right only provides information about the one
  • We presented the Insertion Transformer, a partially autoregressive model for sequence generation based on insertion operations
  • When using the binary tree loss, we find empirically that we can generate sequences of length n using close to the asymptomatic limit of log2 n +1 steps without any quality degradation
Methods
  • The authors explore the efficacy of the approach on a machine translation task, analyzing its performance under different training conditions, architectural choices, and decoding procedures.
  • All the experiments are implemented in TensorFlow (Abadi et al, 2015) using the Tensor2Tensor framework (Vaswani et al, 2018).
  • The authors use the default transformer base hyperparameter set reported by Vaswani et al (2018) for all hyperparameters not specific to the model.
  • All the models are trained for 1,000,000 steps on eight P100 GPUs
Results
  • The authors first train the baseline version of the model with different choices of loss functions and termination strategies.
  • The authors observe that the binary tree loss performs the best when standard greedy decoding is used, attaining a development BLEU score of 21.02.
  • The authors find that the left-to-right models do poorly compared to other orderings.
  • Output: Aber auf der anderen Seite des Staates ist das nicht der Eindruck, den viele von ihrem ehemaligen Gouverneur haben.
  • Aber auf der anderen Seite des Staates ist das nicht der Eindruck , den viele von ihrem ehemaligen Gouverneur haben.
  • Parallel decode: Aber auf der anderen Seite des Staates ist das nicht der Eindruck , Aber auf der anderen Seite des Staates ist das nicht der Eindruck , Aber auf der anderen Seite des Staates ist das nicht der Eindruck , Aber auf der anderen Seite des Staates ist das nicht der Eindruck , den viele von ihrem ehemaligen Gouverneur den viele von ihrem ehemaligen Gouverneur den viele von ihrem ehemaligen Gouverneur den viele von ihrem ehemaligen Gouverneur haben . haben . haben . haben .
Conclusion
  • The authors presented the Insertion Transformer, a partially autoregressive model for sequence generation based on insertion operations.
  • When using the binary tree loss, the authors find empirically that the authors can generate sequences of length n using close to the asymptomatic limit of log2 n +1 steps without any quality degradation.
  • This allows them to match the performance of the standard Transformer on the WMT 2014 English-German translation task while using substantially fewer iterations during decoding
Summary
  • Introduction:

    Neural sequence models (Sutskever et al, 2014; Cho et al, 2014) have been successfully applied to many applications, including machine translation (Bahdanau et al, 2015; Luong et al, 2015), speech recognition (Bahdanau et al, 2016; Chan et al, 2016), speech synthesis (Oord et al, 2016a; Wang et al, 2017), image captioning (Vinyals et al, 2015b; Xu et al, 2015) and image generation (Oord et al, 2016b;c)
  • These models have a common theme: they rely on the chainrule factorization and have an autoregressive left-to-right structure.
  • The autoregressive framework does not accommodate for parallel token generation or more elaborate generation orderings
  • Methods:

    The authors explore the efficacy of the approach on a machine translation task, analyzing its performance under different training conditions, architectural choices, and decoding procedures.
  • All the experiments are implemented in TensorFlow (Abadi et al, 2015) using the Tensor2Tensor framework (Vaswani et al, 2018).
  • The authors use the default transformer base hyperparameter set reported by Vaswani et al (2018) for all hyperparameters not specific to the model.
  • All the models are trained for 1,000,000 steps on eight P100 GPUs
  • Results:

    The authors first train the baseline version of the model with different choices of loss functions and termination strategies.
  • The authors observe that the binary tree loss performs the best when standard greedy decoding is used, attaining a development BLEU score of 21.02.
  • The authors find that the left-to-right models do poorly compared to other orderings.
  • Output: Aber auf der anderen Seite des Staates ist das nicht der Eindruck, den viele von ihrem ehemaligen Gouverneur haben.
  • Aber auf der anderen Seite des Staates ist das nicht der Eindruck , den viele von ihrem ehemaligen Gouverneur haben.
  • Parallel decode: Aber auf der anderen Seite des Staates ist das nicht der Eindruck , Aber auf der anderen Seite des Staates ist das nicht der Eindruck , Aber auf der anderen Seite des Staates ist das nicht der Eindruck , Aber auf der anderen Seite des Staates ist das nicht der Eindruck , den viele von ihrem ehemaligen Gouverneur den viele von ihrem ehemaligen Gouverneur den viele von ihrem ehemaligen Gouverneur den viele von ihrem ehemaligen Gouverneur haben . haben . haben . haben .
  • Conclusion:

    The authors presented the Insertion Transformer, a partially autoregressive model for sequence generation based on insertion operations.
  • When using the binary tree loss, the authors find empirically that the authors can generate sequences of length n using close to the asymptomatic limit of log2 n +1 steps without any quality degradation.
  • This allows them to match the performance of the standard Transformer on the WMT 2014 English-German translation task while using substantially fewer iterations during decoding
Tables
  • Table1: Development BLEU scores obtained via greedy decoding for our basic models trained with various loss functions and termination strategies. The +EOS numbers are the BLEU score obtained when an EOS penalty is applied during decoding to discourage premature stopping. The +Distillation numbers are for models trained with distilled data. The +Parallel numbers are obtained with parallel decoding, which is applicable to models trained with the slot finalization termination condition
  • Table2: Development BLEU scores obtained via greedy decoding when training models with the architectural variants discussed in Section 3.1. All models are trained with a uniform loss and slot finalization on distilled data
  • Table3: Parallel decoding results on the development set for some of our stronger models. All numbers are comparable to or even slightly better than those obtained via greedy decoding, demonstrating that our model can perform insertions in parallel with little to no cost for end performance
  • Table4: BLEU scores on the newstest2014 test set for the WMT 2014 English-German translation task. Our parallel decoding strategy attains the same level of accuracy reached by linear-complexity models while using only a logarithmic number of decoding steps
Download tables as Excel
Related work
  • There has been prior work on non-left-to-right autoregressive generation. Vinyals et al (2015a) explores the modeling of sets, where generation order does not matter. Ford et al (2018) explores language modeling where select words (i.e., functional words) are generated first, and the rest are filled in using a two-pass process. There has also been prior work in hierarchical autoregressive image generation (Reed et al., 2017), where log n steps are required to generate n tokens. This bears some similarity to our balanced binary tree order.

    Shah et al (2018) also recently proposed generating language with a dynamic canvas. Their work can be seen as a continuous relaxation version of our model, wherein their canvas is an embedding space, while our canvas contains discrete tokens. They applied their approach to language modeling tasks, whereas we apply ours to conditional language generation in machine translation.
Reference
  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015.
    Google ScholarFindings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. End-to-End Attention-based Large Vocabulary Speech Recognition. In ICASSP, 2016.
    Google ScholarLocate open access versionFindings
  • Chan, W., Jaitly, N., Le, Q., and Vinyals, O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In ICASSP, 2016.
    Google ScholarLocate open access versionFindings
  • Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Ford, N., Duckworth, D., Norouzi, M., and Dahl, G. E. The Importance of Generation Order in Language Modeling. In EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Graves, A. Sequence Transduction with Recurrent Neural Networks. In ICML Representation Learning Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-Autoregressive Neural Machine Translation. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Hinton, G., Vinyals, O., and Dean, J. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
    Google ScholarLocate open access versionFindings
  • Kim, Y. and Rush, A. M. Sequence-Level Knowledge Distillation. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Lee, J., Mansimov, E., and Cho, K. Deterministic NonAutoregressive Neural Sequence Modeling by Iterative Refinement. In EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Luong, M.-T., Pham, H., and Manning, C. D. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • Norouzi, M., Bengio, S., Zhifeng Chen, N. J., Schuster, M., Wu, Y., and Schuurmans, D. Reward Augmented Maximum Likelihood for Neural Structured Prediction. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. In arXiv, 2016a.
    Google ScholarLocate open access versionFindings
  • Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel Recurrent Neural Networks. In ICML, 2016b.
    Google ScholarLocate open access versionFindings
  • Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. Conditional Image Generation with PixelCNN Decoders. In NIPS, 2016c.
    Google ScholarLocate open access versionFindings
  • Reed, S., van den Oord, A., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., and de Freitas, N. Parallel Multiscale Autoregressive Density Estimation. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. Policy Distillation. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Shah, H., Zheng, B., and Barber, D. Generating Sentences Using a Dynamic Canvas. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise Parallel Decoding for Deep Autoregressive Models. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Sutskever, I., Vinyals, O., and Le, Q. Sequence to Sequence Learning with Neural Networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit, J. Tensor2Tensor for Neural Machine Translation. In AMTA, 2018.
    Google ScholarLocate open access versionFindings
  • Vinyals, O., Bengio, S., and Kudlur, M. Order Matters: Sequence to sequence for sets. In ICLR, 2015a.
    Google ScholarLocate open access versionFindings
  • Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show and Tell: A Neural Image Caption Generator. In CVPR, 2015b.
    Google ScholarLocate open access versionFindings
  • Wang, C., Zhang, J., and Chen, H. Semi-Autoregressive Neural Machine Translation. In EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., and Saurous, R. A. Tacotron: Towards End-to-End Speech Synthesis. In INTERSPEECH, 2017.
    Google ScholarLocate open access versionFindings
  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments