## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

EMNLP/IJCNLP (1), pp.4281-4291, (2019)

EI

Keywords

Abstract

Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the j...More

Code:

Data:

Introduction

- P✓, can be implemented by function approximators such as RNNs (Bahdanau et al, 2015) and Transformers (Vaswani et al, 2017)
- This factorization takes the complicated problem of joint estimation over an exponentially large output space of outputs y, and turns it into a sequence of tractable multi-class classification problems predicting yt given the previous words, allowing for simple maximum loglikelihood training.

Highlights

- Neural sequence-to-sequence models (Bahdanau et al, 2015; Rush et al, 2015; Vinyals et al, 2015; Vaswani et al, 2017) generate an output sequence y = {y1, . . . , yT } given an input sequence x = {x1, . . . , xT 0} using conditional probabilities P✓(y|x) predicted by neural networks.

Most seq2seq models are autoregressive, meaning that they factorize the joint probability of the output sequence given the input sequence P✓(y|x) into the product of probabilities over the next to-

⇤ Equal contribution, in alphabetical order. 1https://github.com/XuezheMax/flowseq (a) (c)

(b) ken in the sequence given the input sequence and previously generated tokens: YT

P✓(y|x) = P✓. (1) t=1

Each factor, P✓, can be implemented by function approximators such as RNNs (Bahdanau et al, 2015) and Transformers (Vaswani et al, 2017) - We propose a simple, effective, and efficient model, FlowSeq, which models expressive prior distribution p✓(z|x) using a powerful mathematical framework called generative flow (Rezende and Mohamed, 2015)
- As noted above, incorporating expressive latent variables z is essential to decouple the dependencies between tokens in the target sequence in non-autoregressive models
- A set of latent variables 2 ⌥ are introduced with a simple prior distribution p⌥( )
- While we perform standard random initialization for most layers of the network, we initialize the last linear transforms that generate the μ and log 2 values with zeros. This ensures that the posterior distribution as a simple normal distribution, which we found helps train very deep generative flows more stably
- We propose FlowSeq, an efficient and effective model for non-autoregressive sequence generation by using generative flows

Methods

- CMLM-base CMLM-base.
- NAT-IR NAT w/ FT (NPD n = 10) NAT-REG (NPD n = 9) LV NAR CMLM-small CMLM-base.
- Dates, which the authors found essential to accelerate training and achieve stable performance.
- Knowledge Distillation Previous work on nonautoregressive generation (Gu et al, 2018; Ghazvininejad et al, 2019) has used translations produced by a pre-trained autoregressive NMT model as the training data, noting that this can significantly improve the performance.
- The authors analyze the impact of distillation in § 4.2

Results

- The authors first conduct experiments to compare the performance of FlowSeq with strong baseline models, including NAT w/ Fertility (Gu et al, 2018), NAT-IR (Lee et al, 2018), NAT-REG (Wang et al, 2019), LV NAR (Shu et al, 2019), CTC Loss (Libovickyand Helcl, 2018), and CMLM (Ghazvininejad et al, 2019).

Table 1 provides the BLEU scores of FlowSeq with argmax decoding, together with baselines with purely non-autoregressive decoding methods that generate output sequence in one parallel pass. - FlowSeq base model achieves significant improvement over CMLM-base and LV NAR.
- It demonstrates the effectiveness of FlowSeq on modeling the complex interdependence in target languages.
- Towards the effect of knowledge distillation, the authors can mainly obtain two observations: i) Similar to the findings in previous work, knowledge distillation still benefits the translation quality of FlowSeq. ii) Compared to previous models, the benefit of knowledge distillation on FlowSeq is less significant, yielding less than 3 BLEU improvement on WMT2014 DE-EN corpus, and even no improvement on WMT2016 RO-EN corpus.
- The reason might be that FlowSeq does not rely much on knowledge distillation to alleviate the multi-modality problem

Conclusion

- Different from the architecture proposed in Ziegler and Rush (2019), the architecture of FlowSeq is not using any autoregressive flow (Kingma et al, 2016; Papamakarios et al, 2017), yielding a truly non-autoregressive model with efficient generation.
- Note that the FlowSeq remains nonautoregressive even if the authors use an RNN in the architecture because RNN is only used to encode a complete sequence of codes and all the input tokens can be fed into the RNN in parallel
- This makes it possible to use highly-optimized implementations of RNNs such as those provided by cuDNN.3.
- One potential direction for future work is to leverage iterative refinement techniques such as masked language models to further improve translation quality
- Another exciting direction is to, theoretically and empirically, investigate the latent space in FlowSeq, providing deep insights of the model, even enhancing controllable text generation

Summary

## Introduction:

P✓, can be implemented by function approximators such as RNNs (Bahdanau et al, 2015) and Transformers (Vaswani et al, 2017)- This factorization takes the complicated problem of joint estimation over an exponentially large output space of outputs y, and turns it into a sequence of tractable multi-class classification problems predicting yt given the previous words, allowing for simple maximum loglikelihood training.
## Methods:

CMLM-base CMLM-base.- NAT-IR NAT w/ FT (NPD n = 10) NAT-REG (NPD n = 9) LV NAR CMLM-small CMLM-base.
- Dates, which the authors found essential to accelerate training and achieve stable performance.
- Knowledge Distillation Previous work on nonautoregressive generation (Gu et al, 2018; Ghazvininejad et al, 2019) has used translations produced by a pre-trained autoregressive NMT model as the training data, noting that this can significantly improve the performance.
- The authors analyze the impact of distillation in § 4.2
## Results:

The authors first conduct experiments to compare the performance of FlowSeq with strong baseline models, including NAT w/ Fertility (Gu et al, 2018), NAT-IR (Lee et al, 2018), NAT-REG (Wang et al, 2019), LV NAR (Shu et al, 2019), CTC Loss (Libovickyand Helcl, 2018), and CMLM (Ghazvininejad et al, 2019).

Table 1 provides the BLEU scores of FlowSeq with argmax decoding, together with baselines with purely non-autoregressive decoding methods that generate output sequence in one parallel pass.- FlowSeq base model achieves significant improvement over CMLM-base and LV NAR.
- It demonstrates the effectiveness of FlowSeq on modeling the complex interdependence in target languages.
- Towards the effect of knowledge distillation, the authors can mainly obtain two observations: i) Similar to the findings in previous work, knowledge distillation still benefits the translation quality of FlowSeq. ii) Compared to previous models, the benefit of knowledge distillation on FlowSeq is less significant, yielding less than 3 BLEU improvement on WMT2014 DE-EN corpus, and even no improvement on WMT2016 RO-EN corpus.
- The reason might be that FlowSeq does not rely much on knowledge distillation to alleviate the multi-modality problem
## Conclusion:

Different from the architecture proposed in Ziegler and Rush (2019), the architecture of FlowSeq is not using any autoregressive flow (Kingma et al, 2016; Papamakarios et al, 2017), yielding a truly non-autoregressive model with efficient generation.- Note that the FlowSeq remains nonautoregressive even if the authors use an RNN in the architecture because RNN is only used to encode a complete sequence of codes and all the input tokens can be fed into the RNN in parallel
- This makes it possible to use highly-optimized implementations of RNNs such as those provided by cuDNN.3.
- One potential direction for future work is to leverage iterative refinement techniques such as masked language models to further improve translation quality
- Another exciting direction is to, theoretically and empirically, investigate the latent space in FlowSeq, providing deep insights of the model, even enhancing controllable text generation

- Table1: BLEU scores on three MT benchmark datasets for FlowSeq with argmax decoding and baselines with purely non-autoregressive decoding method. The first and second block are results of models trained w/w.o. knowledge distillation, respectively
- Table2: BLEU scores on two WMT datasets of models using advanced decoding methods. The first block are Transformer-base (<a class="ref-link" id="cVaswani_et+al_2017_a" href="#rVaswani_et+al_2017_a">Vaswani et al, 2017</a>). The second and the third block are results of models trained w/w.o. knowledge distillation, respectively. n = l ⇥ r is the total number of candidates for rescoring

Funding

- This work was supported in part by DARPA grant FA8750-18-2-0018 funded under the AIDA program and grant HR0011-15-C-0114 funded under the LORELEI program

Reference

- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR).
- Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
- Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Constant-time machine translation with conditional masked language models. arXiv preprint arXiv:1904.09324.
- Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR).
- Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019.
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improving variational inference with inverse autoregressive flow. The 29th Conference on Neural Information Processing Systems.
- Jindrich Libovickyand Jindrich Helcl. 2018. End-toend non-autoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3016– 3021.
- Xuezhe Ma and Eduard Hovy. 2019. Macow: Masked convolutional generative flow. arXiv preprint arXiv:1902.04208.
- Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stackpointer networks for dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1403–1414.
- Xuezhe Ma, Chunting Zhou, and Eduard Hovy. 2019. Mae: Mutual posterior-divergence regularization for variational autoencoders. In Proceedings of the 7th
- Oren Melamud, Jacob Goldberger, and Ido Dagan. 20context2vec: Learning generic context embedding with bidirectional LSTM. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 51–61, Berlin, Germany. Association for Computational Linguistics.
- Myle Ott, Michael Auli, David Grangier, et al. 2018. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning, pages 3953–3962.
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- George Papamakarios, Theo Pavlakou, and Iain Murray. 2017. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
- Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-
- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–36IEEE.
- Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224.
- Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1173– 1182.
- Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pages 1530–1538. JMLR. org.
- Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015
- Tianxiao Shen, Myle Ott, Michael Auli, et al. 2019. Mixture models for diverse machine translation: Tricks of the trade. In International Conference on Machine Learning, pages 5719–5728.
- Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2019. Latent-variable nonautoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181.
- Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. 2019. Insertion transformer: Flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
- Martin J Wainwright, Michael I Jordan, et al. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1–2):1–305.
- Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245.
- Zachary Ziegler and Alexander Rush. 2019. Latent normalizing flows for discrete sequences. In International Conference on Machine Learning, pages 7673–7682.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn