Cascaded Text Generation with Markov Transformers

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views184
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We demonstrate that probabilistic autoregressive models can achieve sub-linear decoding time while retaining high fidelity translations by replacing beam search with a cascaded inference approach

Abstract:

The two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with boun...More
0
Introduction
  • Probabilistic text generation is a ubiquitous tool in natural language processing.
  • State-of-the-art text generation approaches rely on fully autoregressive models such as RNNs and transformers [51], in which the probability of an output word depends on all previous words.
  • Researchers have proposed alternative parallel generation models.
  • One class of non-autoregressive probabilistic models assumes that each word’s output probability is independent of other words [13, 65, 28].
  • While it is impressive that these models perform well, this independence assumption is very strong and often results in noticeable artifacts such as repetitions [13, 49]
Highlights
  • Probabilistic text generation is a ubiquitous tool in natural language processing
  • State-of-the-art text generation approaches rely on fully autoregressive models such as RNNs and transformers [51], in which the probability of an output word depends on all previous words
  • We use Byte Pair Encoding (BPE) [45, 23] learned on the training set with a shared vocabulary between source and
  • We demonstrate that probabilistic autoregressive models can achieve sub-linear decoding time while retaining high fidelity translations by replacing beam search with a cascaded inference approach
  • Experiments on five commonly used machine translation benchmark datasets validate that our approach is competitive in terms of accuracy/speed tradeoff with other state-of-the-art parallel decoding methods, and practically useful with distillation
  • Our work proposes an alternative approach to beam search that enables more efficient text generation
Methods
  • Datasets The authors evaluate the approach on five commonly used machine translation benchmark datasets: IWSLT14 De-En [6] (∼160k parallel sentences), WMT14 En-De/De-En1 [29] (∼4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] (∼610k parallel sentences).
  • For WMT16 the authors use the processed data provided by [25].
  • The base settings are from FAIRSEQ3 [34]: For IWSLT14 De-En, the authors use 6 layers, 4 attention heads, model dimension 512, hidden dimension 1024; for WMT14 En-De/De-En and WMT16 En-Ro/Ro-En the authors use 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048.
Results
  • The authors' results are competitive to previous works, even those using a reranker.
  • On WMT14 En-De, the authors can get 26.52 BLEU score at a 4.68× speedup, compared to NART-DCRF that reaches 26.80 BLEU at a 4.39× speedup using 19 candidate sentences to rerank.
  • On IWSLT14, the BLEU scores are much better than previous works: the authors can reach within 0.54 BLEU score compared to transformer at a 5.88× speedup (K = 16, iters=2), 6 BLEU points better than FlowSeq
Conclusion
  • The authors demonstrate that probabilistic autoregressive models can achieve sub-linear decoding time while retaining high fidelity translations by replacing beam search with a cascaded inference approach.
  • The authors' approach, based on [56], iteratively prunes the search space using increasingly higher-order models.
  • To support this inference procedure, the authors utilize Markov transformers, a variant of transformer that can be used to parameterize cascades of CRFs. Experiments on five commonly used machine translation benchmark datasets validate that the approach is competitive in terms of accuracy/speed tradeoff with other state-of-the-art parallel decoding methods, and practically useful with distillation.
Summary
  • Introduction:

    Probabilistic text generation is a ubiquitous tool in natural language processing.
  • State-of-the-art text generation approaches rely on fully autoregressive models such as RNNs and transformers [51], in which the probability of an output word depends on all previous words.
  • Researchers have proposed alternative parallel generation models.
  • One class of non-autoregressive probabilistic models assumes that each word’s output probability is independent of other words [13, 65, 28].
  • While it is impressive that these models perform well, this independence assumption is very strong and often results in noticeable artifacts such as repetitions [13, 49]
  • Methods:

    Datasets The authors evaluate the approach on five commonly used machine translation benchmark datasets: IWSLT14 De-En [6] (∼160k parallel sentences), WMT14 En-De/De-En1 [29] (∼4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] (∼610k parallel sentences).
  • For WMT16 the authors use the processed data provided by [25].
  • The base settings are from FAIRSEQ3 [34]: For IWSLT14 De-En, the authors use 6 layers, 4 attention heads, model dimension 512, hidden dimension 1024; for WMT14 En-De/De-En and WMT16 En-Ro/Ro-En the authors use 6 layers, 8 attention heads, model dimension 512, hidden dimension 2048.
  • Results:

    The authors' results are competitive to previous works, even those using a reranker.
  • On WMT14 En-De, the authors can get 26.52 BLEU score at a 4.68× speedup, compared to NART-DCRF that reaches 26.80 BLEU at a 4.39× speedup using 19 candidate sentences to rerank.
  • On IWSLT14, the BLEU scores are much better than previous works: the authors can reach within 0.54 BLEU score compared to transformer at a 5.88× speedup (K = 16, iters=2), 6 BLEU points better than FlowSeq
  • Conclusion:

    The authors demonstrate that probabilistic autoregressive models can achieve sub-linear decoding time while retaining high fidelity translations by replacing beam search with a cascaded inference approach.
  • The authors' approach, based on [56], iteratively prunes the search space using increasingly higher-order models.
  • To support this inference procedure, the authors utilize Markov transformers, a variant of transformer that can be used to parameterize cascades of CRFs. Experiments on five commonly used machine translation benchmark datasets validate that the approach is competitive in terms of accuracy/speed tradeoff with other state-of-the-art parallel decoding methods, and practically useful with distillation.
Tables
  • Table1: Main results. †: latency numbers not directly comparable due to platform differences
  • Table2: Markov transformer with different search strategies on IWSLT14 De-En val w/o distillation. Column ∆L shows the length constraint (L − ∆L to L + ∆L), where None denotes no constraint
  • Table3: Cascaded Decoding Example. When m = 4, Viterbi in X4 returns “an amazing woman . eos”. The source is “eine erstaunliche frau . eos” and the target is “an amazing woman . eos”
  • Table4: Cascaded Decoding Example. When m = 4, Viterbi in X4 returns “what has happened ? eos”. The source is “was ist passiert ? eos” and the target is “what happened ? eos”
  • Table5: Cascaded Decoding Example. When m = 4, Viterbi in X4 returns “you ’re happy . eos”. The source is “du bist glücklich . eos” and the target is “you ’re happy . eos”
  • Table6: Cascaded Decoding Example. When m = 4, Viterbi in X4 returns “let ’s move . eos”. The source is “bewe@@ g dich . eos” and the target is “move it . eos”
  • Table7: Cascaded Decoding Example. When m = 4, Viterbi in X4 returns “very , very hard . eos”. The source is “sehr sehr schwer . eos” and the target is “very very hard . eos”
  • Table8: Cascaded Decoding Example. When m = 4, Viterbi in X4 returns “the opposite thing happened . eos”. The source is “das gegenteil passierte . eos” and the target is “the opposite happened . eos”
  • Table9: Results on WMT14 De-En
  • Table10: Results on WMT16 En-Ro
  • Table11: Results on WMT16 Ro-En
  • Table12: Results on IWSLT14 De-En
  • Table13: Optimization settings. We use the same settings for knowledge distillation experiments
Download tables as Excel
Related work
  • There has been extensive interest in non-autoregressive/parallel generation approaches, aiming at producing a sequence in parallel sub-linear time w.r.t. sequence length [13, 52, 26, 65, 53, 14, 11, 12, 48, 15, 28, 16, 49, 55, 30, 41, 64, 62]. Existing approaches can be broadly classified as latent variable based [13, 26, 65, 28, 41], refinement-based [25, 48, 14, 15, 11, 30, 12, 62] or a combination of both [41].

    Latent-variable approaches factor out the dependencies among output words, such that we can generate each word independently of each other conditioned on those latent variables. The training of these approaches usually employs variational autoencoders, since the log marginal is intractable [21, 37, 31]. The introduced latent variables enable generation in a single forward pass, achieving O(1) time complexity regardless of sequence length, but many of them suffer from generation artifacts such as repetitions [13]. While not using latent variables, our approach could be extended to incorporate them. A notable difference is that the parallel time complexity of this work is not O(1) but O(log L) w.r.t. sequence length. In practice though, the only O(log L) part in our approach takes a negligible fraction of total time [49], and our approach reaches comparable speedup compared to existing approaches with O(1) time complexity.
Funding
  • This project was supported by NSF SHF 1704834, CAREER IIS-1845664, and Intel
  • YD is supported by a Baidu AI fellowship
Study subjects and analysis
machine translation datasets: 5
To parameterize this cascade, we introduce a Markov transformer, a variant of the popular fully autoregressive model that allows us to simultaneously decode with specific autoregressive context cutoffs. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets. Datasets We evaluate our approach on five commonly used machine translation benchmark datasets: IWSLT14 De-En [6] (∼160k parallel sentences), WMT14 En-De/De-En1 [29] (∼4M parallel sentences) and WMT16 En-Ro/Ro-En2 [3] (∼610k parallel sentences)

machine translation datasets: 5
To parameterize this cascade, we introduce a Markov transformer, a variant of the popular fully autoregressive model that allows us to simultaneously decode with specific autoregressive context cutoffs. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets. Probabilistic text generation is a ubiquitous tool in natural language processing

machine translation datasets: 5
Under review. Experiments on five machine translation datasets compare this approach to other beam search and nonautoregressive baselines. Our inference approach is comparably fast to non-autoregressive methods while allowing for local dependencies in a principled, probabilistic way

Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc’Aurelio Ranzato, and Arthur Szlam. Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351, 2019.
    Findings
  • Ondrej Bojar, Yvette Graham, Amir Kamran, and Miloš Stanojevic. Results of the wmt16 metrics shared task. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 199–231, 2016.
    Google ScholarLocate open access versionFindings
  • Alessandro Bondielli and Francesco Marcelloni. A survey on fake news and rumour detection techniques. Information Sciences, 497:38–55, 2019.
    Google ScholarLocate open access versionFindings
  • Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
    Google ScholarFindings
  • Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, volume 57, 2014.
    Google ScholarLocate open access versionFindings
  • Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 173–180. Association for Computational Linguistics, 2005.
    Google ScholarLocate open access versionFindings
  • Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, R Shrivaths, Jeremy Moore, Michael Pozar, et al. Multilevel coarse-to-fine pcfg parsing. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 168–175. Association for Computational Linguistics, 2006.
    Google ScholarLocate open access versionFindings
  • Robert Faris, Hal Roberts, Bruce Etling, Nikki Bourassa, Ethan Zuckerman, and Yochai Benkler. Partisanship, propaganda, and disinformation: Online media and the 2016 us presidential election. Berkman Klein Center Research Publication, 6, 2017.
    Google ScholarLocate open access versionFindings
  • Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019.
    Findings
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Constant-time machine translation with conditional masked language models. arXiv preprint arXiv:1904.09324, 2019.
    Findings
  • Marjan Ghazvininejad, Omer Levy, and Luke Zettlemoyer. Semi-autoregressive training improves maskpredict decoding. arXiv preprint arXiv:2001.08785, 2020.
    Findings
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
    Findings
  • Jiatao Gu, Qi Liu, and Kyunghyun Cho. Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7:661–676, 2019.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In Advances in Neural Information Processing Systems, pages 11179–11189, 2019.
    Google ScholarLocate open access versionFindings
  • Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730, 2019.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. arXiv, pages arXiv–1910, 2019.
    Google ScholarFindings
  • Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
    Findings
  • John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
    Google ScholarFindings
  • Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018.
    Findings
  • Jindrich Libovickyand Jindrich Helcl. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. arXiv preprint arXiv:1811.04719, 2018.
    Findings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
    Findings
  • Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. Flowseq: Non-autoregressive conditional sequence generation with generative flow. arXiv preprint arXiv:1909.02480, 2019.
    Findings
  • Matouš Machácek and Ondrej Bojar. Results of the wmt14 metrics shared task. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 293–301, 2014.
    Google ScholarLocate open access versionFindings
  • Elman Mansimov, Alex Wang, and Kyunghyun Cho. A generalized framework of sequence generation with application to undirected sequence models. arXiv preprint arXiv:1905.12790, 2019.
    Findings
  • Andriy Mnih and Karol Gregor. Neural Variational Inference and Learning in Belief Networks. In Proceedings of ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, pages 4696–4705, 2019.
    Google ScholarLocate open access versionFindings
  • Ephraim Nissan. Digital technologies and artificial intelligence’s present and foreseeable impact on lawyering, judging, policing and law enforcement. Ai & Society, 32(3):441–464, 2017.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
    Findings
  • Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
    Findings
  • Georg Rehm. Cracking the language barrier for a multilingual europe.
    Google ScholarFindings
  • Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Alexander M Rush. Torch-struct: Deep structured prediction library. arXiv preprint arXiv:2002.00876, 2020.
    Findings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
    Findings
  • Alexander M Rush and Slav Petrov. Vine pruning for efficient multi-pass dependency parsing. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 498–507. Association for Computational Linguistics, 2012.
    Google ScholarLocate open access versionFindings
  • Chitwan Saharia, William Chan, Saurabh Saxena, and Mohammad Norouzi. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020.
    Findings
  • Simo Särkkä and Ángel F García-Fernández. Temporal parallelization of bayesian filters and smoothers. arXiv preprint arXiv:1905.13002, 2019.
    Findings
  • Tal Schuster, Roei Schuster, Darsh J Shah, and Regina Barzilay. Are we safe yet? the limitations of distributional features for fake news detection. arXiv preprint arXiv:1908.09805, 2019.
    Findings
  • Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointergenerator networks. arXiv preprint arXiv:1704.04368, 2017.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
    Findings
  • Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
    Findings
  • Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, and Jasmine Wang. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
    Findings
  • Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. Insertion transformer: Flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249, 2019.
    Findings
  • Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. Fast structured decoding for sequence models. In Advances in Neural Information Processing Systems, pages 3011–3020, 2019.
    Google ScholarLocate open access versionFindings
  • Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Chunqi Wang, Ji Zhang, and Haiqing Chen. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583, 2018.
    Findings
  • Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5377–5384, 2019.
    Google ScholarLocate open access versionFindings
  • Claire Wardle and Hossein Derakhshan. Information disorder: Toward an interdisciplinary framework for research and policy making. Council of Europe report, 27, 2017.
    Google ScholarLocate open access versionFindings
  • Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and Xu Sun. Imitation learning for nonautoregressive neural machine translation. arXiv preprint arXiv:1906.02041, 2019.
    Findings
  • David Weiss and Benjamin Taskar. Structured prediction cascades. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 916–923, 2010.
    Google ScholarLocate open access versionFindings
  • Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
    Findings
  • Sam Wiseman, Stuart M Shieber, and Alexander M Rush. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052, 2017.
    Findings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. In Advances in Neural Information Processing Systems, pages 9051–9062, 2019.
    Google ScholarLocate open access versionFindings
  • Wen Zhang, Liang Huang, Yang Feng, Lei Shen, and Qun Liu. Speeding up neural machine translation decoding by cube pruning. arXiv preprint arXiv:1809.02992, 2018.
    Findings
  • Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558, 2020.
    Findings
  • Chunting Zhou, Graham Neubig, and Jiatao Gu. Understanding knowledge distillation in nonautoregressive machine translation. arXiv preprint arXiv:1911.02727, 2019.
    Findings
  • Jiawei Zhou and Phillip Keung. Improving non-autoregressive neural machine translation with monolingual data. arXiv preprint arXiv:2005.00932, 2020.
    Findings
  • Zachary M Ziegler and Alexander M Rush. Latent normalizing flows for discrete sequences. arXiv preprint arXiv:1901.10548, 2019.
    Findings
Full Text
Your rating :
0

 

Tags
Comments