Understanding Knowledge Distillation in Non-autoregressive Machine Translation

ICLR, 2020.

Cited by: 5|Bibtex|Views138
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We first systematically examine why knowledge distillation improves the performance of non-autoregressive translation models

Abstract:

Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model ...More

Code:

Data:

Introduction
  • Traditional neural machine translation (NMT) systems (Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017) generate sequences in an autoregressive fashion; each target token is predicted step-by-step by conditioning on the previous generated tokens in a monotonic order.
  • Sequence-level knowledge distillation (Kim & Rush, 2016) – a special variant of the original approach – is applied during NAT model training by replacing the target side of training samples with the outputs from a pre-trained AT model trained on the same corpus with a roughly equal number of parameters
  • It is usually assumed (Gu et al, 2018) that knowledge distillation’s reduction of the “modes” in the training data is the key reason why distillation benefits NAT training.
  • To achieve competitive performance with the autoregressive model, almost all existing NAT models rely on training using data distilled from a pre-trained AT model instead of the real parallel training set, as described below
Highlights
  • Traditional neural machine translation (NMT) systems (Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017) generate sequences in an autoregressive fashion; each target token is predicted step-by-step by conditioning on the previous generated tokens in a monotonic order
  • Because multiple translations are possible for a single input sentence (the so-called multi-modality problem; Gu et al (2018)), vanilla non-autoregressive translation models can fail to capture the dependencies between output tokens
  • To better study why distillation is crucial for non-autoregressive translation models, we propose quantitative measures for analyzing the complexity and faithfulness of parallel data, two properties that we hypothesize are important for non-autoregressive translation training
  • C(d) reflects the level of multi-modality of a parallel corpus, and we have shown that a simpler data set is favorable to an non-autoregressive translation model
  • We first systematically examine why knowledge distillation improves the performance of non-autoregressive translation models
  • We propose several techniques that can adjust the complexity of a data set to match the capacity of an non-autoregressive translation model for better performance
Results
  • We show that we can achieve the state-of-the-art performance for NAT models and largely match the performance of the AT model.
  • By changing the distilled data set upon which the models are trained, we are able to significantly improve the state-of-the-art results for models in a particular class.
  • We first systematically examine why knowledge distillation improves the performance of NAT models
Conclusion
  • We first systematically examine why knowledge distillation improves the performance of NAT models.
  • We conducted extensive experiments with autoregressive teacher models of different capacity and a wide range of NAT models.
  • We defined metrics that can quantitatively measure the complexity of a parallel data set.
  • NAT model requires a more complex distilled data to achieve better performance.
  • We propose several techniques that can adjust the complexity of a data set to match the capacity of an NAT model for better performance
Summary
  • Introduction:

    Traditional neural machine translation (NMT) systems (Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017) generate sequences in an autoregressive fashion; each target token is predicted step-by-step by conditioning on the previous generated tokens in a monotonic order.
  • Sequence-level knowledge distillation (Kim & Rush, 2016) – a special variant of the original approach – is applied during NAT model training by replacing the target side of training samples with the outputs from a pre-trained AT model trained on the same corpus with a roughly equal number of parameters
  • It is usually assumed (Gu et al, 2018) that knowledge distillation’s reduction of the “modes” in the training data is the key reason why distillation benefits NAT training.
  • To achieve competitive performance with the autoregressive model, almost all existing NAT models rely on training using data distilled from a pre-trained AT model instead of the real parallel training set, as described below
  • Results:

    We show that we can achieve the state-of-the-art performance for NAT models and largely match the performance of the AT model.
  • By changing the distilled data set upon which the models are trained, we are able to significantly improve the state-of-the-art results for models in a particular class.
  • We first systematically examine why knowledge distillation improves the performance of NAT models
  • Conclusion:

    We first systematically examine why knowledge distillation improves the performance of NAT models.
  • We conducted extensive experiments with autoregressive teacher models of different capacity and a wide range of NAT models.
  • We defined metrics that can quantitatively measure the complexity of a parallel data set.
  • NAT model requires a more complex distilled data to achieve better performance.
  • We propose several techniques that can adjust the complexity of a data set to match the capacity of an NAT model for better performance
Tables
  • Table1: Complexity C(d) (↑ more complex) of the Europarl data set of different settings in §3.1
  • Table2: AT and NAT models. Number of param-
  • Table3: Comparisons of decoding methods in the output translations compared to sampling or on WMT14-ENDE newstest 2014 test set. the true distribution
  • Table4: Results w/ and w/o sequencepicks the sentence with the highest sentence-level BLEU level interpolation with LevT
  • Table5: Basic hyper-parameters of architecture for AT models
  • Table6: Dataset statistics for WMT14 En-De
Download tables as Excel
Funding
  • Finds that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data
  • Proposes several approaches that can alter the complexity of data sets to improve the performance of NAT models
  • Proposes metrics for measuring complexity and faithfulness for a given training set
  • Proposes approaches to further adjust the complexity of the distilled data in order to match the model’s capacity
Reference
  • Nader Akoury, Kalpesh Krishna, and Mohit Iyyer. Syntactically supervised transformers for faster neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1269–1281, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1122. URL https://www.aclweb.org/anthology/P19-1122.
    Locate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
    Findings
  • Chris Dyer, Victor Chahuneau, and Noah Smith. A simple, fast, and effective reparameterization of IBM Model 2. In NAACL, 2013.
    Google ScholarFindings
  • Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born-again neural networks. In International Conference on Machine Learning, pp. 1602–1611, 2018.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Constant-time machine translation with conditional masked language models. arXiv preprint arXiv:1904.09324, 2019.
    Findings
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, Canada, April 30-May 3, 2018, Conference Track Proceedings, 2018.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Changhan Wang, and Jake Zhao. Levenshtein transformer. In Advances in Neural Information Processing Systems 33. 2019.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    Findings
  • Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 944–952. Association for Computational Linguistics, 2010.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
    Google ScholarLocate open access versionFindings
  • Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pp. 2395–2404, 2018.
    Google ScholarLocate open access versionFindings
  • Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1317–1327, 2016.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
    Google ScholarLocate open access versionFindings
  • Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1173–1182, 2018.
    Google ScholarLocate open access versionFindings
  • Percy Liang, Hal Daume III, and Dan Klein. Structure compilation: trading structure for features. In ICML, pp. 592–599, 2008.
    Google ScholarLocate open access versionFindings
  • Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136, 2017.
    Findings
  • Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. Flowseq: Nonautoregressive conditional sequence generation with generative flow. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, November 2019.
    Google ScholarLocate open access versionFindings
  • Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pp. 3915–3923, 2018.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, pp. 3953–3962, 2018. URL http://proceedings.mlr.press/v80/ott18a.html.
    Locate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Maja Popovic. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395, 2015.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://www.aclweb.org/anthology/P16-1162.
    Locate open access versionFindings
  • Chenze Shao, Yang Feng, Jinchao Zhang, Fandong Meng, Xilin Chen, and Jie Zhou. Retrieving sequential information for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.09444, 2019.
    Findings
  • Tianxiao Shen, Myle Ott, Michael Auli, et al. Mixture models for diverse machine translation: Tricks of the trade. In International Conference on Machine Learning, pp. 5719–5728, 2019.
    Google ScholarLocate open access versionFindings
  • Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. Latent-variable nonautoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181, 2019.
    Findings
  • Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pp. 223–231, 2006.
    Google ScholarLocate open access versionFindings
  • Milos Stanojevic and Khalil Simaan. Beer: Better evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 414–419, 2014.
    Google ScholarLocate open access versionFindings
  • Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems, pp. 10107–10116, 2018.
    Google ScholarLocate open access versionFindings
  • Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. Insertion transformer: Flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249, 2019.
    Findings
  • David Talbot, Hideto Kazawa, Hiroshi Ichikawa, Jason Katz-Brown, Masakazu Seno, and Franz J Och. A lightweight evaluation framework for machine translation reordering. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 12–21. Association for Computational Linguistics, 2011.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Chunqi Wang, Ji Zhang, and Haiqing Chen. Semi-autoregressive neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 479–488, 2018.
    Google ScholarLocate open access versionFindings
  • Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245, 2019.
    Findings
  • Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and Xu Sun. Imitation learning for nonautoregressive neural machine translation. arXiv preprint arXiv:1906.02041, 2019.
    Findings
  • Model All the AT models are implemented based on the Transformer model using fairseq (Ott et al., 2019), and we basically follow the fairseq examples to train the transformers6. Following the notation from Vaswani et al. (2017), we list the basic parameters of all the AT model we used: Models dmodel dhidden nlayers nheads pdropout tiny
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments