AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Semi-autoregressive and non-autoregressive translation as individual tasks with different k, and propose a task-level curriculum mechanism to shift the training process from k = 1 to N, where N is the length of the target sentence

Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation

IJCAI 2020, pp.3861-3867, (2020)

Cited by: 9|Views1929
EI

Abstract

Non-autoregressive translation (NAT) achieves faster inference speed but at the cost of worse accuracy compared with autoregressive translation (AT). Since AT and NAT can share model structure and AT is an easier task than NAT due to the explicit dependency on previous target-side tokens, a natural idea is to gradually shift the model t...More

Code:

Data:

0
Introduction
Highlights
  • Neural Machine Translation (NMT) has witnessed rapid progress in recent years [Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017]
  • We introduce semi-autoregressive translation (SAT) [Wang et al, 2018], which only generates a part of the tokens in parallel at each decoding step, as intermediate tasks to bridge the shift process from autoregressive translation to non-autoregressive translation
  • We propose a task-level curriculum learning for non-autoregressive translation (TCL-non-autoregressive translation), which trains the model with sequentially increased k
  • As for the inference efficiency, we achieve a 16.0 times speedup (NPD 9), which is comparable with state of the art methods (FCL-non-autoregressive translation and ENAT)
  • Semi-autoregressive and non-autoregressive translation as individual tasks with different k, and propose a task-level curriculum mechanism to shift the training process from k = 1 to N, where N is the length of the target sentence
  • Experiments on several benchmark translation datasets demonstrate the effectiveness of our method for non-autoregressive translation
Results
  • The authors compare TCL-NAT with non-autoregressive baselines including NAT-FT [Gu et al, 2018], NAT-IR [Lee et al, 2018], ENAT [Guo et al, 2019a], NAT-Reg [Wang et al, 2019], FlowSeq [Ma et al, 2019] and FCL-NAT [Guo et al, 2019b].
  • For ENAT, NAT-Reg and FCL-NAT, the authors report their best results with B = 0 and B = 4 correspondingly.
  • As for the inference efficiency, the authors achieve a 16.0 times speedup (NPD 9), which is comparable with state of the art methods (FCL-NAT and ENAT)
Conclusion
  • The authors proposed a novel task-level curriculum learning method to improve the accuracy of non-autoregressive Task Window w=1 27.89 31.51 w=2 28.16 31.79 w=3 28.00 31.44 w=4.
  • Semi-autoregressive and non-autoregressive translation as individual tasks with different k, and propose a task-level curriculum mechanism to shift the training process from k = 1 to N , where N is the length of the target sentence.
  • The authors expect task-level curriculum learning could become a general training paradigm for a broader range of tasks
Tables
  • Table1: The BLEU scores on the test set of IWSLT14 De-En task. The model is trained with k for 80k steps but test with another k . The italic numbers show the accuracy of models that train and test with the same k. Row 1 shows that models trained with task k = 4, 8, 16 can achieve reasonable accuracy on NAT. The bold numbers show that models trained with task k = k < k can achieve better scores than that trained with task k < k when testing the accuracy of task k = k
  • Table2: The training steps of TCL-NAT for different datasets for each phase
  • Table3: The proposed different curriculum pacing functions and their definitions. SSAT denotes the total steps in SAT training phase. We choose constants empirically to meet the actual training situation
  • Table4: The BLEU scores of our proposed TCL-NAT and the baseline methods on the IWSLT14 De-En, IWSLT16 En-De, WMT14 De-En and WMT14 En-De tasks. NPD 9 indicates results of noisy parallel decoding with 9 candidates, i.e., B = 4, otherwise B = 0
  • Table5: The comparison of BLEU scores on the test set of IWSLT14 De-En task among different pacing functions
  • Table6: The comparison of BLEU scores on the test set of IWSLT14 De-En task among different task windows
Download tables as Excel
Related work
  • In this section, we first introduce the related works on neural machine translation, including autoregressive translation (AT), non-autoregressive translation (NAT) and semiautoregressive translation (SAT), and then describe three learning paradigms: transfer learning, multitask learning and curriculum learning, which are related to our method.

    2.1 Neural Machine Translation (AT/NAT/SAT)

    An autoregressive translation (AT) model takes source sentence s as input and then generates the tokens of target sentence y one by one during the inference process [Bahdanau et al, 2015; Sutskever et al, 2014; Vaswani et al, 2017], which causes much inference latency. To improve the inference speed of AT models, a series of works develop non-autoregressive translation (NAT) models based on Transformer [Gu et al, 2018; Lee et al, 2018; Li et al, 2019; Wang et al, 2019; Guo et al, 2019a], which generate all the target tokens in parallel. Several works introduce auxiliary components or losses to improve the accuracy of NAT models: Wang et al [2019] and Li et al [2019] propose auxiliary loss functions to solve the problem that NAT models tend to translate missing and duplicating tokens; Guo et al [2019a] try to enhance the decoder input with target-side information by leveraging auxiliary information; Ma et al [2019] introduce generative flow to directly model the joint distribution of all target tokens simultaneously. While NAT models achieve faster inference speed, the translation accuracy is still worse than AT model. Some works aim to balance the translation accuracy and inference latency between AT and NAT by introducing semi-autoregressive translation (SAT) [Wang et al, 2018], which generates multiple adjacent tokens in parallel during the autoregressive generation.

    Different from the above works, we leverage AT, SAT and NAT together and schedule the training in a curriculum way to achieve better translation accuracy for NAT.
Funding
  • This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), National Natural Science Foundation of China (Grant No.61751209), and Microsoft Research Asia
Reference
  • [Anastasopoulos and Chiang, 2018] Antonios Anastasopoulos and David Chiang. Tied multitask learning for neural speech translation. In NAACL, pages 82–91, June 2018.
    Google ScholarLocate open access versionFindings
  • [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [Bengio et al., 2009] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
    Google ScholarLocate open access versionFindings
  • [Caruana, 1997] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
    Google ScholarLocate open access versionFindings
  • [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • [Dong et al., 2015] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In ACL-IJCNLP, pages 1723– 1732, 2015.
    Google ScholarLocate open access versionFindings
  • [Garg et al., 2019] Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, and Matthias Paulik. Jointly learning to align and translate with transformer models. In EMNLPIJCNLP, pages 4452–4461, November 2019.
    Google ScholarLocate open access versionFindings
  • [Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. In ICML, pages 1243–1252, 2017.
    Google ScholarLocate open access versionFindings
  • [Ghazvininejad et al., 2019] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP-IJCNLP, pages 6114–6123, 2019.
    Google ScholarLocate open access versionFindings
  • [Gu et al., 2018] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Nonautoregressive neural machine translation. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [Gu et al., 2019] Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In Advances in Neural Information Processing Systems, pages 11179–11189, 2019.
    Google ScholarLocate open access versionFindings
  • [Guo et al., 2019a] Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neural machine translation with enhanced decoder input. In AAAI, volume 33, pages 3723–3730, 2019.
    Google ScholarLocate open access versionFindings
  • [Guo et al., 2019b] Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan Liu. Fine-tuning by curriculum learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1911.08717, 2019.
    Findings
  • [Kim and Rush, 2016] Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In EMNLP, pages 1317–1327, 2016.
    Google ScholarLocate open access versionFindings
  • [Lee and Grauman, 2011] Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR 2011, pages 1721–1728. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • [Lee et al., 2018] Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In EMNLP, pages 1173–1182, 2018.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2019] Zhuohan Li, Zi Lin, Di He, Fei Tian, QIN Tao, WANG Liwei, and Tie-Yan Liu. Hint-based training for non-autoregressive machine translation. In EMNLPIJCNLP, pages 5712–5717, 2019.
    Google ScholarLocate open access versionFindings
  • [Ma et al., 2019] Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. Flowseq: Nonautoregressive conditional sequence generation with generative flow. In EMNLP-IJCNLP, pages 4273–4283, 2019.
    Google ScholarLocate open access versionFindings
  • [Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
    Google ScholarLocate open access versionFindings
  • [Ren et al., 2019] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019.
    Findings
  • [Sachan and Xing, 2016] Mrinmaya Sachan and Eric Xing. Easy questions first? a case study on curriculum learning for question answering. In ACL, volume 1, pages 453–463, 2016.
    Google ScholarLocate open access versionFindings
  • [Sarafianos et al., 2017] Nikolaos Sarafianos, Theodore Giannakopoulos, Christophoros Nikou, and Ioannis A Kakadiaris. Curriculum learning for multi-task classification of visual attributes. In ICCV, pages 2608–2615, 2017.
    Google ScholarLocate open access versionFindings
  • [Sennrich et al., 2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, pages 1715–1725, 2016.
    Google ScholarLocate open access versionFindings
  • [Song et al., 2019] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation. In ICML, pages 5926–5936, 2019.
    Google ScholarLocate open access versionFindings
  • [Sutskever et al., 2014] I Sutskever, O Vinyals, and QV Le. Sequence to sequence learning with neural networks. Advances in NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000–6010, 2017.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al., 2018] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. Tensor2tensor for neural machine translation. In AMTA, pages 193–199, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018] Chunqi Wang, Ji Zhang, and Haiqing Chen. Semi-autoregressive neural machine translation. In EMNLP, pages 479–488, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2019] Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressive machine translation with auxiliary regularization. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科