AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Through a series of empirical studies, we demonstrate that the difficulty of NAR generation correlates on the target token dependency, and knowledge distillation as well as alignment constraint reduces the dependency of target tokens and encourages the model to rely more on sourc...
A Study of Non-autoregressive Model for Sequence Generation
ACL, pp.149-159, (2020)
Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel, resulting in faster generation speed compared to their autoregressive (AR) counterparts but at the cost of lower accuracy. Different techniques including knowledge distillation and source-target alignment have been proposed to bridge the gap between AR an...More
PPT (Upload PPT)
- Non-autoregressive (NAR) models (Oord et al, 2017; Gu et al, 2017; Chen et al, 2019; Ren et al, 2019), which generate all the tokens in a target sequence in parallel and can speed up inference, are widely explored in natural language and speech processing tasks such as neural machine translation (NMT) (Gu et al, 2017; Lee et al, 2018; Guo et al, 2019a; Wang et al, 2019; Li et al, 2019b; Guo et al, 2019b), automatic speech recognition (ASR) (Chen et al, 2019) and text to speech (TTS) synthesis (Oord et al, 2017; Ren et al, 2019).
- To better understand NAR sequence generation and answer the above questions, the authors need to characterize and quantify the target-token dependency, which turns out to be non-trivial since the sequences could be of different modalities
- For this purpose, the authors design a novel model called COnditional Masked prediction model with MixAttention (CoMMA), inspired by the mix-attention in He et al (2018) and the masked language modeling in Devlin et al (2018): in CoMMA, (1) the prediction of one target token can attend to all the source and target tokens with mix-attention, and 2) target tokens are randomly masked with varying probabilities.
- CoMMA can help them to measure target-token dependency using the ratio of the attention weights on target context over that on full context when predicting a target token: bigger ratio, larger dependency among target tokens
- Non-autoregressive (NAR) models (Oord et al, 2017; Gu et al, 2017; Chen et al, 2019; Ren et al, 2019), which generate all the tokens in a target sequence in parallel and can speed up inference, are widely explored in natural language and speech processing tasks such as neural machine translation (NMT) (Gu et al, 2017; Lee et al, 2018; Guo et al, 2019a; Wang et al, 2019; Li et al, 2019b; Guo et al, 2019b), automatic speech recognition (ASR) (Chen et al, 2019) and text to speech (TTS) synthesis (Oord et al, 2017; Ren et al, 2019)
- Considering that R(p) measures how much context information from target side is needed to generate a target token, we can see that automatic speech recognition has more dependency on the target context and less on the source context, while text to speech is the opposite, which is consistent with the accuracy gap between AR and NAR models as we described in Section 3.1
- We find that R(p) in neural machine translation decreases quicker than the other two tasks, which indicates that neural machine translation is good at learning from source context when less context information can be leveraged from the target side while R(p) in automatic speech recognition decreases little
- It can be seen that knowledge distillation can boost the accuracy of NAR in neural machine translation and text to speech, which is consistent with the previous works
- We design a novel conditional masked prediction model with mix-attention and a metric called attention density ratio to measure the dependency on target context when predicting a target token, which can analyze these questions in a unified method
- Through a series of empirical studies, we demonstrate that the difficulty of NAR generation correlates on the target token dependency, and knowledge distillation as well as alignment constraint reduces the dependency of target tokens and encourages the model to rely more on source context for target token prediction, which improves the accuracy of NAR models
- Results of Accuracy Gap
The accuracies of the AR and NAR models in each task are shown in
3.2 The Token Dependency
In the last subsection, the authors analyze the difficulty of NAR models from the perspective of the accuracy gap.
- Considering that R(p) measures how much context information from target side is needed to generate a target token, the authors can see that ASR has more dependency on the target context and less on the source context, while TTS is the opposite, which is consistent with the accuracy gap between AR and NAR models as the authors described in Section 3.1.
- It can be seen that attention constraint can improve the performance of NMT and TTS as previous works (Li et al, 2019b; Ren et al, 2019) demonstrated, and help the NAR-ASR model achieve better scores.
- The authors conducted a comprehensive study on NAR models in NMT, ASR and TTS tasks to analyze several research questions, including the difficulty of NAR generation and why knowledge distillation and alignment constraint can help NAR models.
- Through a series of empirical studies, the authors demonstrate that the difficulty of NAR generation correlates on the target token dependency, and knowledge distillation as well as alignment constraint reduces the dependency of target tokens and encourages the model to rely more on source context for target token prediction, which improves the accuracy of NAR models.
- The authors believe the analyses can shed light on the understandings and further improvements on NAR models
- Table1: The AR and NAR model we consider in each task. “AC” means attention constraint we mentioned in Section 5
- Table2: The accuracy gap between NAR and AR models. It can be seen that NAR model can match the accuracy of AR model gap in TTS, while the gap still exists in ASR and NMT. We calculate both the WER and BLEU metrics in ASR and NMT for better comparison. It can be seen that ASR has a larger gap than NMT. Larger accuracy gap may indicate more difficult for NAR generation in this task. Next, we try to understand what factors influence difficulties among different tasks
- Table3: The comparison between NAR models with and without knowledge distillation
- Table4: The comparison between NAR models with and without alignment constraint
- Table5: Hyperparameters of transformer-based AR and NAR models
- Table6: Hyperparameters of CoMMA
- Several works try to analyze and understand NAR models on different tasks. We discuss these analyses from the two aspects: knowledge distillation and source-target alignment constraint.
Knowledge Distillation Knowledge distillation has long been used to compress the model size (Hinton et al, 2015; Furlanello et al, 2018; Yang et al, 2018; Anil et al, 2018; Li et al, 2017) or transfer the knowledge of teacher model to student model (Tan et al, 2019; Liu et al, 2019a,b), and soon been applied to NAR models (Gu et al, 2017; Oord et al, 2017; Guo et al, 2019a; Wang et al, 2019; Li et al, 2019b; Guo et al, 2019b; Ren et al, 2019) to boost the accuracy. Some works focus on studying why knowledge distillation works: Phuong and Lampert (2019) provide some insights into the mechanisms of knowledge distillation by studying the special case of linear and deep linear classifiers and find that data geometry, optimization bias and strong monotonicity determine the success of distillation; Yuan et al (2019) argue that the success of KD is also due to the regularization of soft targets, which might be as important as the similarity information between categories.
However, few works have studied the cause of why knowledge distillation benefits NAR training. Recently, Zhou et al (2019) investigate why knowledge distillation is important for the training of NAR model in NMT task and find that knowledge distillation can reduce the complexity of data sets and help NAR model to learn the variations in the output data.
- This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), and National Natural Science Foundation of China (Grant No.61751209)
- This work was also partially funded by Microsoft Research Asia
- Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235.
- Nanxin Chen, Shinji Watanabe, Jesus Villalba, and Najim Dehak. 2019. Non-autoregressive transformer automatic speech recognition. arXiv preprint arXiv:1911.04908.
- Wenhu Chen, Evgeny Matusov, Shahram Khadivi, and Jan-Thorsten Peter. 2016. Guided alignment training for topic-aware neural machine translation. CoRR, abs/1607.01628.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born again neural networks. arXiv preprint arXiv:1805.04770.
- Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
- Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019a. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730.
- Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2019b. Fine-tuning by curriculum learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1911.08717.
- Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. 2018. Layer-wise coordination between encoder and decoder for neural machine translation. In Advances in Neural Information Processing Systems, pages 7944–7954.
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Keith Ito. The lj speech dataset, 2017a. url ttps. keithito. com/LJ-Speech-Dataset.
- Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019. A comparative study on transformer vs rnn in speech applications. arXiv preprint arXiv:1909.06317.
- Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.
- Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and M Zhou. 2019a. Neural speech synthesis with transformer network. AAAI.
- Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from noisy labels with distillation. In ICCV, pages 1928– 1936.
- Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019b. Hint-based training for non-autoregressive machine translation. arXiv preprint arXiv:1909.06708.
- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.
- Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu, Haifeng Wang, and Chengqing Zong. 2019b. End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075.
- Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433.
- Mary Phuong and Christoph Lampert. 2019. Towards understanding knowledge distillation. In International Conference on Machine Learning, pages 5142–5151.
- Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–36IEEE.
- Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE.
- Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-level ensemble distillation for grapheme-to-phoneme conversion. In INTERSPEECH.
- Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and TieYan Liu. 2019. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
- Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. In AAAI.
- Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan Yuille. 2018. Knowledge distillation in generations: More tolerant teachers educate better students. arXiv preprint arXiv:1805.05551.
- Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. 2019. Revisit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723.
- Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. Libritts: A corpus derived from librispeech for textto-speech. arXiv preprint arXiv:1904.02882.
- Chunting Zhou, Graham Neubig, and Jiatao Gu. 2019. Understanding knowledge distillation in nonautoregressive machine translation. arXiv preprint arXiv:1911.02727.