AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Semi-autoregressive and non-autoregressive translation as individual tasks with different k, and propose a task-level curriculum mechanism to shift the training process from k = 1 to N, where N is the length of the target sentence
Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation
IJCAI 2020, pp.3861-3867, (2020)
Non-autoregressive translation (NAT) achieves faster inference speed but at the cost of worse accuracy compared with autoregressive translation (AT). Since AT and NAT can share model structure and AT is an easier task than NAT due to the explicit dependency on previous target-side tokens, a natural idea is to gradually shift the model t...More
PPT (Upload PPT)
- Neural Machine Translation (NMT) has witnessed rapid progress in recent years [Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017].
- A variety of works have tried to improve the accuracy of NAT, including enhanced decoder input with embedding mapping [Guo et al, 2019a], generative flow [Ma et al, 2019], and iterative refinement [Ghazvininejad et al, 2019; Lee et al, 2018], etc.
- As AT models are more accurate and easier to train than NAT models due to the explicit dependency on previous tokens, a natural idea is to first train the model with easier AT, and continue to train it with harder NAT
- Neural Machine Translation (NMT) has witnessed rapid progress in recent years [Bahdanau et al, 2015; Gehring et al, 2017; Vaswani et al, 2017]
- We introduce semi-autoregressive translation (SAT) [Wang et al, 2018], which only generates a part of the tokens in parallel at each decoding step, as intermediate tasks to bridge the shift process from autoregressive translation to non-autoregressive translation
- We propose a task-level curriculum learning for non-autoregressive translation (TCL-non-autoregressive translation), which trains the model with sequentially increased k
- As for the inference efficiency, we achieve a 16.0 times speedup (NPD 9), which is comparable with state of the art methods (FCL-non-autoregressive translation and ENAT)
- Semi-autoregressive and non-autoregressive translation as individual tasks with different k, and propose a task-level curriculum mechanism to shift the training process from k = 1 to N, where N is the length of the target sentence
- Experiments on several benchmark translation datasets demonstrate the effectiveness of our method for non-autoregressive translation
- The authors compare TCL-NAT with non-autoregressive baselines including NAT-FT [Gu et al, 2018], NAT-IR [Lee et al, 2018], ENAT [Guo et al, 2019a], NAT-Reg [Wang et al, 2019], FlowSeq [Ma et al, 2019] and FCL-NAT [Guo et al, 2019b].
- For ENAT, NAT-Reg and FCL-NAT, the authors report their best results with B = 0 and B = 4 correspondingly.
- As for the inference efficiency, the authors achieve a 16.0 times speedup (NPD 9), which is comparable with state of the art methods (FCL-NAT and ENAT)
- The authors proposed a novel task-level curriculum learning method to improve the accuracy of non-autoregressive Task Window w=1 27.89 31.51 w=2 28.16 31.79 w=3 28.00 31.44 w=4.
- Semi-autoregressive and non-autoregressive translation as individual tasks with different k, and propose a task-level curriculum mechanism to shift the training process from k = 1 to N , where N is the length of the target sentence.
- The authors expect task-level curriculum learning could become a general training paradigm for a broader range of tasks
- Table1: The BLEU scores on the test set of IWSLT14 De-En task. The model is trained with k for 80k steps but test with another k . The italic numbers show the accuracy of models that train and test with the same k. Row 1 shows that models trained with task k = 4, 8, 16 can achieve reasonable accuracy on NAT. The bold numbers show that models trained with task k = k < k can achieve better scores than that trained with task k < k when testing the accuracy of task k = k
- Table2: The training steps of TCL-NAT for different datasets for each phase
- Table3: The proposed different curriculum pacing functions and their definitions. SSAT denotes the total steps in SAT training phase. We choose constants empirically to meet the actual training situation
- Table4: The BLEU scores of our proposed TCL-NAT and the baseline methods on the IWSLT14 De-En, IWSLT16 En-De, WMT14 De-En and WMT14 En-De tasks. NPD 9 indicates results of noisy parallel decoding with 9 candidates, i.e., B = 4, otherwise B = 0
- Table5: The comparison of BLEU scores on the test set of IWSLT14 De-En task among different pacing functions
- Table6: The comparison of BLEU scores on the test set of IWSLT14 De-En task among different task windows
- In this section, we first introduce the related works on neural machine translation, including autoregressive translation (AT), non-autoregressive translation (NAT) and semiautoregressive translation (SAT), and then describe three learning paradigms: transfer learning, multitask learning and curriculum learning, which are related to our method.
2.1 Neural Machine Translation (AT/NAT/SAT)
An autoregressive translation (AT) model takes source sentence s as input and then generates the tokens of target sentence y one by one during the inference process [Bahdanau et al, 2015; Sutskever et al, 2014; Vaswani et al, 2017], which causes much inference latency. To improve the inference speed of AT models, a series of works develop non-autoregressive translation (NAT) models based on Transformer [Gu et al, 2018; Lee et al, 2018; Li et al, 2019; Wang et al, 2019; Guo et al, 2019a], which generate all the target tokens in parallel. Several works introduce auxiliary components or losses to improve the accuracy of NAT models: Wang et al  and Li et al  propose auxiliary loss functions to solve the problem that NAT models tend to translate missing and duplicating tokens; Guo et al [2019a] try to enhance the decoder input with target-side information by leveraging auxiliary information; Ma et al  introduce generative flow to directly model the joint distribution of all target tokens simultaneously. While NAT models achieve faster inference speed, the translation accuracy is still worse than AT model. Some works aim to balance the translation accuracy and inference latency between AT and NAT by introducing semi-autoregressive translation (SAT) [Wang et al, 2018], which generates multiple adjacent tokens in parallel during the autoregressive generation.
Different from the above works, we leverage AT, SAT and NAT together and schedule the training in a curriculum way to achieve better translation accuracy for NAT.
- This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), National Natural Science Foundation of China (Grant No.61751209), and Microsoft Research Asia
- [Anastasopoulos and Chiang, 2018] Antonios Anastasopoulos and David Chiang. Tied multitask learning for neural speech translation. In NAACL, pages 82–91, June 2018.
- [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
- [Bengio et al., 2009] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
- [Caruana, 1997] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
- [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- [Dong et al., 2015] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In ACL-IJCNLP, pages 1723– 1732, 2015.
- [Garg et al., 2019] Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, and Matthias Paulik. Jointly learning to align and translate with transformer models. In EMNLPIJCNLP, pages 4452–4461, November 2019.
- [Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. In ICML, pages 1243–1252, 2017.
- [Ghazvininejad et al., 2019] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP-IJCNLP, pages 6114–6123, 2019.
- [Gu et al., 2018] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Nonautoregressive neural machine translation. In ICLR, 2018.
- [Gu et al., 2019] Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In Advances in Neural Information Processing Systems, pages 11179–11189, 2019.
- [Guo et al., 2019a] Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neural machine translation with enhanced decoder input. In AAAI, volume 33, pages 3723–3730, 2019.
- [Guo et al., 2019b] Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan Liu. Fine-tuning by curriculum learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1911.08717, 2019.
- [Kim and Rush, 2016] Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In EMNLP, pages 1317–1327, 2016.
- [Lee and Grauman, 2011] Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR 2011, pages 1721–1728. IEEE, 2011.
- [Lee et al., 2018] Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In EMNLP, pages 1173–1182, 2018.
- [Li et al., 2019] Zhuohan Li, Zi Lin, Di He, Fei Tian, QIN Tao, WANG Liwei, and Tie-Yan Liu. Hint-based training for non-autoregressive machine translation. In EMNLPIJCNLP, pages 5712–5717, 2019.
- [Ma et al., 2019] Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. Flowseq: Nonautoregressive conditional sequence generation with generative flow. In EMNLP-IJCNLP, pages 4273–4283, 2019.
- [Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
- [Ren et al., 2019] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019.
- [Sachan and Xing, 2016] Mrinmaya Sachan and Eric Xing. Easy questions first? a case study on curriculum learning for question answering. In ACL, volume 1, pages 453–463, 2016.
- [Sarafianos et al., 2017] Nikolaos Sarafianos, Theodore Giannakopoulos, Christophoros Nikou, and Ioannis A Kakadiaris. Curriculum learning for multi-task classification of visual attributes. In ICCV, pages 2608–2615, 2017.
- [Sennrich et al., 2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, pages 1715–1725, 2016.
- [Song et al., 2019] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation. In ICML, pages 5926–5936, 2019.
- [Sutskever et al., 2014] I Sutskever, O Vinyals, and QV Le. Sequence to sequence learning with neural networks. Advances in NIPS, 2014.
- [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000–6010, 2017.
- [Vaswani et al., 2018] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. Tensor2tensor for neural machine translation. In AMTA, pages 193–199, 2018.
- [Wang et al., 2018] Chunqi Wang, Ji Zhang, and Haiqing Chen. Semi-autoregressive neural machine translation. In EMNLP, pages 479–488, 2018.
- [Wang et al., 2019] Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Non-autoregressive machine translation with auxiliary regularization. In AAAI, 2019.