AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose aligned cross entropy as an alternative loss function for training of non-autoregressive models

Aligned Cross Entropy for Non-Autoregressive Machine Translation

ICML, pp.3515-3523, (2020)

Cited by: 3|Views104
EI
Full Text
Bibtex
Weibo

Abstract

Non-autoregressive machine translation models significantly speed up decoding by allowing for parallel prediction of the entire target sequence. However, modeling word order is more challenging due to the lack of autoregressive factors in the model. This difficultly is compounded during training with cross entropy loss, which can highly...More

Code:

Data:

0
Introduction
  • Non-autoregressive machine translation models can significantly improve decoding speed by predicting every word in parallel (Gu et al, 2018; Libovicky & Helcl, 2018).
  • The authors present a new training loss for non-autoregressive machine translation that softens the penalty for word order errors, and significantly improves performance with no modification to the model or to the decoding algorithm.
Highlights
  • Non-autoregressive machine translation models can significantly improve decoding speed by predicting every word in parallel (Gu et al, 2018; Libovicky & Helcl, 2018)
  • We evaluate conditional masked language models trained with AXE on 6 standard machine translation benchmarks, and demonstrate that AXE significantly improves performance over cross entropy trained conditional masked language models and over recently-proposed nonautoregressive models as well
  • AXE vs Cross Entropy We first compare the performance of AXE-trained conditional masked language models to that of conditional masked language models trained with the original cross entropy loss
  • State of the Art We compare the performance of conditional masked language models with AXE against nine strong baseline models: the fertilitybased sequence-to-sequence model (Gu et al, 2018), transformers trained with CTC loss (Libovicky & Helcl, 2018), the iterative refinement approach (Lee et al, 2018), transformers trained with auxiliary regularization (Wang et al, 2019), conditional masked language models trained with cross entropy loss (Ghazvininejad et al, 2019), Flowseq: a latent variable model based on generative flow (Ma et al, 2019), hint-based training (Li et al, 2019), bag-of-ngrams training (Shao et al, 2019), and the CRF-based semi-autoregressive model (Sun et al, 2019)
  • We introduced Aligned Cross Entropy (AXE) as an alternative loss function for training non-autoregressive models
  • In the context of machine translation, a conditional masked language model (CMLM) trained with AXE significantly outperforms cross entropy trained models, setting a new state-of-the-art for non-autoregressive models
Methods
  • The authors evaluate CMLMs trained with AXE on 6 standard machine translation benchmarks, and demonstrate that AXE significantly improves performance over cross entropy trained CMLMs and over recently-proposed nonautoregressive models as well. 4.1.
  • Translation Benchmarks The authors evaluate the method on both directions of three standard machine translation datasets with various training data sizes: WMT’14 EnglishGerman (4.5M sentence pairs), WMT’16 English-Romanian (610k pairs), and WMT’17 English-Chinese (20M pairs).
  • The authors train all models with mixed precision floating point arithmetic on 16 Nvidia V100 GPUs. For autoregressive decoding, the authors use a beam size of b = 5 (Vaswani et al, 2017) and tune
Results
  • AXE vs Cross Entropy The authors first compare the performance of AXE-trained CMLMs to that of CMLMs trained with the original cross entropy loss.
  • State of the Art The authors compare the performance of CMLMs with AXE against nine strong baseline models: the fertilitybased sequence-to-sequence model (Gu et al, 2018), transformers trained with CTC loss (Libovicky & Helcl, 2018), the iterative refinement approach (Lee et al, 2018), transformers trained with auxiliary regularization (Wang et al, 2019), CMLMs trained with cross entropy loss (Ghazvininejad et al, 2019), Flowseq: a latent variable model based on generative flow (Ma et al, 2019), hint-based training (Li et al, 2019), bag-of-ngrams training (Shao et al, 2019), and the CRF-based semi-autoregressive model (Sun et al, 2019).
Conclusion
  • The authors introduced Aligned Cross Entropy (AXE) as an alternative loss function for training non-autoregressive models.
  • AXE focuses on relative order and lexical matching instead of relying on absolute positions.
  • In the context of machine translation, a conditional masked language model (CMLM) trained with AXE significantly outperforms cross entropy trained models, setting a new state-of-the-art for non-autoregressive models
Tables
  • Table1: The three local update operators in AXE’s dynamic program
  • Table2: The performance (test set BLEU) of AXE CMLM compared to cross entropy CMLM on all of our benchmarks. Both models are purely non-autoregressive, using a single forward pass during argmax decoding
  • Table3: The performance (test set BLEU) of CMLMs trained with AXE, compared to other non-autoregressive methods. The standard (autoregressive) transformer results are also reported for reference
  • Table4: The performance (test set BLEU) of AXE CMLM, compared to other non-autoregressive methods on raw data. The result of AXE CMLM trained with distillation is also reported as a reference
  • Table5: The effect of different training objectives on performance, measured on WMT14 DE-EN and EN-DE (validation set BLEU)
  • Table6: The effect of changing the skip target penalty coefficient δ on performance (BLEU) and the percentage of target words that were skipped, using the validation sets of WMT14 DE-EN and EN-DE
  • Table7: The effect of tuning the length multiplier λ on performance (BLEU), using the validation set
  • Table8: The performance (test set BLEU) of cross entropy CMLM and AXE CMLM on WMT’14 EN-DE and DE-EN, bucketed by target sequence length (N )
  • Table9: The percentage of repeated tokens on the test sets of WMT’14 EN-DE and DE-EN
Download tables as Excel
Related work
  • Advances in neural machine translation techniques in recent years has brought an increasing interest in breaking the autoregressive generation bottleneck in translation models.

    Semi-autoregressive models introduce partial parallelism into the decoding process. Some of these techniques include iterative refinement of translations based on previous predictions (Lee et al, 2018; Ghazvininejad et al, 2019; 2020; Gu et al, 2019; Kasai et al, 2020) and combining a lighter autoregressive decoder with a non-autoregressive one (Sun et al, 2019).

    Building a fully non-autoregrssive machine translation model is a much more challenging task. One branch of prior work approaches this problem by modeling with latent variables. Gu et al (2018) introduces word fertility as a latent variable to model the number of generated tokens per each source word. Ma et al (2019) uses generative flow to model complex distribution of latent variables for parallel decoding of target. Shu et al (2019) proposes a latent-variable non-autoregressive model with continuous latent variables and a deterministic inference procedure.
Reference
  • Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41– 48, 2009.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Ghazvininejad, M., Levy, O., Liu, Y., and Zettlemoyer, L. Mask-predict: Parallel decoding of conditional masked language models. In Proc. of EMNLPIJCNLP, 2019. URL https://www.aclweb.org/anthology/D19-1633.
    Locate open access versionFindings
  • Ghazvininejad, M., Levy, O., and Zettlemoyer, L. Semiautoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020.
    Findings
  • Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. In Proc. of ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Proc. of NeurIPS, 2019. URL https://arxiv.org/abs/1905.11006.
    Findings
  • Kasai, J., Cross, J., Ghazvininejad, M., and Gu, J. Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136, 2020.
    Findings
  • Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. In Proc. of EMNLP, 2016. URL https://arxiv.org/abs/1606.07947.
    Findings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Lee, J. D., Mansimov, E., and Cho, K. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. of EMNLP, 2018. URL https://arxiv.org/abs/1802.06901.
    Findings
  • Li, Z., Lin, Z., He, D., Tian, F., Qin, T., Wang, L., and Liu, T.-Y. Hint-based training for non-autoregressive machine translation. arXiv preprint arXiv:1909.06708, 2019.
    Findings
  • Libovicky, J. and Helcl, J. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3016–3021, Brussels, Belgium, 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D18-1336.
    Locate open access versionFindings
  • Ma, X., Zhou, C., Li, X., Neubig, G., and Hovy, E. Flowseq: Non-autoregressive conditional sequence generation with generative flow. arXiv preprint arXiv:1909.02480, 2019.
    Findings
  • Sakoe, H. and Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1): 43–49, 1978.
    Google ScholarLocate open access versionFindings
  • Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), August 2016. URL https://www.aclweb.org/anthology/P16-1162.
    Locate open access versionFindings
  • Shao, C., Zhang, J., Feng, Y., Meng, F., and Zhou, J. Minimizing the bag-of-ngrams difference for nonautoregressive neural machine translation. arXiv preprint arXiv:1911.09320, 2019.
    Findings
  • Shu, R., Lee, J., Nakayama, H., and Cho, K. Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181, 2019.
    Findings
  • Stern, M., Chan, W., Kiros, J. R., and Uszkoreit, J. Insertion transformer: Flexible sequence generation via insertion operations. In Proc. of ICML, 2019. URL https://arxiv.org/abs/1902.03249.
    Findings
  • Sun, Z., Li, Z., Wang, H., He, D., Lin, Z., and Deng, Z. Fast structured decoding for sequence models. In Advances in Neural Information Processing Systems, pp. 3011–3020, 2019.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Neubig, G., Dou, Z.-Y., Hu, J., Michel, P., Pruthi, D., and Wang, X. compare-mt: A tool for holistic comparison of language generation systems. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) Demo Track, Minneapolis, USA, June 2019. URL http://arxiv.org/abs/1903.07926.
    Findings
  • Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, July 2002. URL https://www.aclweb.org/anthology/P02-1040.
    Locate open access versionFindings
  • Post, M. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, October 2018. URL https://www.aclweb.org/anthology/W18-6319.
    Locate open access versionFindings
  • Wang, Y., Tian, F., He, D., Qin, T., Zhai, C., and Liu, T.-Y. Non-autoregressive machine translation with auxiliary regularization. In Proc. of AAAI, 2019. URL https://arxiv.org/abs/1902.10245.
    Findings
  • Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. Pay less attention with lightweight and dynamic convolutions. International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Yang, B., Liu, F., and Zou, Y. Non-autoregressive video captioning with iterative refinement. arXiv preprint arXiv:1911.12018, 2019.
    Findings
  • Zhou, C., Neubig, G., and Gu, J. Understanding knowledge distillation in non-autoregressive machine translation. arXiv preprint arXiv:1911.02727, 2019.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科