AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
All the sentences are first tokenized with moses tokenizer2 and segmented into subword symbols using Byte Pair Encoding, except for the label to train the speech segmenter, where we
SimulSpeech: End-to-End Simultaneous Speech to Text Translation
ACL, pp.3787-3796, (2020)
1 Introduction In this work, we develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a...More
PPT (Upload PPT)
- The authors develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.
- Previous works (Berard et al, 2016; Weiss et al, 2017; Liu et al, 2019) on speech to text translation focus on the full-sentence translation where the full source speech can be seen when predicting each target token.
- As shown in Figure 2b, the authors introduce the CTC loss for the training of the speech segmenter, and attention-level and data-level knowledge distillation for the training of the overall SimulSpeech model.
- In this work, we develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently
- To better train the SimulSpeech model, we propose a novel attention-level knowledge distillation that is specially designed for speech to text translation, 4.3 Data-Level Knowledge Distillation
- All the sentences are first tokenized with moses tokenizer2 and segmented into subword symbols using Byte Pair Encoding (BPE) (Sennrich et al, 2016), except for the label to train the speech segmenter, where we
- We extend the average proportion and average latency metric that are originally calculated on word sequence to speech sequence for simultaneous speech to text translation task
- The accuracy of SimulSpeech model is always better than the testtime wait-k, which demonstrates the effectiveness of the SimulSpeech
- We further introduced several techniques including data-level and attention-level knowledge distillation to boost the accuracy of SimulSpeech
- The authors use the best path decoding (Graves et al, 2006) to decide the word boundary without seeing subsequent speech frames, which is consistent with the masked self-attention in speech encoder, i.e., the output of segmenter for position i depends only on the inputs at positions preceding i.
- In order to obtain the attention weights of simultaneous ASR and NMT, the authors add auxiliary simultaneous ASR and NMT tasks which share the same encoder or decoder with SimulSpeech model respectively, as shown in Figure 2b.
- To better train the SimulSpeech model, the authors propose a novel attention-level knowledge distillation that is specially designed for speech to text translation, 4.3 Data-Level Knowledge Distillation
- The authors train a full-sentence NMT teacher model first and generate target text y given source text y that is paired with source speech x.
- The authors train the student SimulSpeech with the generated target text y which is paired with the source speech x.
- Comparison with Cascaded Models the authors implement the cascaded simultaneous speech to text translation pipeline and compare the accuracy of SimulSpeech with it under the same translation
- From the BLEU scores in Row 2 and Row 3, it can be seen that the translation accuracy with different wait-k can be boosted by adding auxiliary task to naive simultaneous speech to text translation model.
- To-end speech to text translation and achieved comparable accuracy to the cascaded models.
- Ma et al (2018) introduced a very simple but effective wait-k strategy for simultaneous NMT based on a prefix-to-prefix framework, which predicts the target word conditioned on the partial source sequence the model has seen, instead of the full source sequence.
- Works on speech to text translation rely on a two-stage method by cascaded ASR and NMT models.
- The authors developed SimulSpeech, an endto-end simultaneous speech to text translation system that directly translates source speech into target text concurrently.
- SimulSpeech consists of a speech encoder, a speech segmenter, and a text decoder with wait-k strategy for simultaneous translation.
- Table1: The number of sentences and the duration of
- Table2: The BLEU scores of SimulSpeech on the test set of the MuST-C En→Es and En →De dataset. FS denotes training with k=inf
- Table3: The comparison between two-stage cascaded method and SimulSpeech under different wait-k on En→Es dataset
- Table4: The ablation studies on En→Es dataset. The baseline model (Naive S2T) is the naive simultaneous speech to text translation model with wait-k policy. We gradually add our techniques on it to evaluate their effectiveness
- Table5: The BLEU scores of SimulSpeech on En→Es using our speech segmentation method and groundtruth segmentation
- 7 Conclusion
6.1 Speech to Text Translation
Speech to text translation has been a hot research topic in the field of artificial intelligence recently (Berard et al, 2016; Weiss et al, 2017; Liu et al, 2019). Early works on speech to text translation rely on a two-stage method by cascaded ASR and NMT models. Berard et al (2016) proposed an end-to-end speech to text translation system, which does not leverage source language text during training or inference. Weiss et al (2017) further leveraged an auxiliary ASR model with a shared encoder with the speech to text model, regarding it as a multi-task problem. Vila et al (2018) applied Transformer (Vaswani et al, 2017b) architecture to this task and achieved good accuracy. Bansal et al (2018) explored speech to text translation in the low-resource setting where both data and computation are limited. Sperber et al (2019) proposed a novel attention-passing model for end-
- This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), and National Natural Science Foundation of China (Grant No.61751209)
- This work was also partially funded by Microsoft Research Asia
- Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and Sharon Goldwater. 2018. Lowresource speech-to-text translation. arXiv preprint arXiv:1803.09164.
- Alexandre Berard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744.
- Kyunghyun Cho and Masha Esipova. 2016. Can neural machine translation do simultaneous translation? arXiv preprint arXiv:1606.02012.
- Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. 2018. Incremental decoding and training methods for simultaneous translation in neural machine translation. arXiv preprint arXiv:1806.03661.
- Mattia Antonino Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2019. MuST-C: a Multilingual Speech Translation Corpus. In NAACL-HLT, Minneapolis, MN, USA.
- Christian Fugen, Alex Waibel, and Muntsin Kolss. 2007. Simultaneous translation of lectures and speeches. Machine translation, 21(4):209–252.
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 201Convolutional sequence to sequence learning. In ICML, pages 1243–1252. JMLR. org.
- Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM.
- Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li. 2016. Learning to translate in realtime with neural machine translation. arXiv preprint arXiv:1610.00388.
- Yoon Kim and Alexander M Rush. 2016. Sequencelevel knowledge distillation. arXiv preprint arXiv:1606.07947.
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075.
- Mingbo Ma, Liang Huang, Hao Xiong, Kaibo Liu, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, and Haifeng Wang. 2018. Stacl: Simultaneous translation with integrated anticipation and controllable latency. arXiv preprint arXiv:1810.08398.
- Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 20Optimizing segmentation strategies for simultaneous speech translation. In ACL, pages 551–556.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311– 318.
- Kanishka Rao, Hasim Sak, and Rohit Prabhavalkar. 2017. Exploring architectures, data and units for streaming end-to-end speech recognition with rnntransducer. In ASRU, pages 193–199. IEEE.
- Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
- Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP, pages 4779–4783. IEEE.
- Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel. 2019. Attention-passing models for robust and data-efficient end-to-end speech translation. TACL, 7:313–325.
- Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and TieYan Liu. 2019. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461.
- Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. In NIPS 2017, pages 6000–6010.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. In NIPS, pages 5998–6008.
- Laura Cross Vila, Carlos Escolano, Jose AR Fonollosa, and Marta R Costa-jussa. 2018. End-to-end speech translation with the transformer. In IberSPEECH, pages 60–63.
- Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen. 2017. Sequence-tosequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581.
- Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. 2019. Simultaneous translation with flexible policy via restricted imitation learning. arXiv preprint arXiv:1906.01135.
- Chunting Zhou, Graham Neubig, and Jiatao Gu. 2019. Understanding knowledge distillation in nonautoregressive machine translation. arXiv preprint arXiv:1911.02727.