AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
All the sentences are first tokenized with moses tokenizer2 and segmented into subword symbols using Byte Pair Encoding, except for the label to train the speech segmenter, where we

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

ACL, pp.3787-3796, (2020)

Cited by: 15|Views189
EI
Full Text
Bibtex
Weibo

Abstract

1 Introduction In this work, we develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently. SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a...More

Code:

Data:

0
Introduction
  • The authors develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.
  • Previous works (Berard et al, 2016; Weiss et al, 2017; Liu et al, 2019) on speech to text translation focus on the full-sentence translation where the full source speech can be seen when predicting each target token.
  • As shown in Figure 2b, the authors introduce the CTC loss for the training of the speech segmenter, and attention-level and data-level knowledge distillation for the training of the overall SimulSpeech model.
Highlights
  • In this work, we develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently
  • To better train the SimulSpeech model, we propose a novel attention-level knowledge distillation that is specially designed for speech to text translation, 4.3 Data-Level Knowledge Distillation
  • All the sentences are first tokenized with moses tokenizer2 and segmented into subword symbols using Byte Pair Encoding (BPE) (Sennrich et al, 2016), except for the label to train the speech segmenter, where we
  • We extend the average proportion and average latency metric that are originally calculated on word sequence to speech sequence for simultaneous speech to text translation task
  • The accuracy of SimulSpeech model is always better than the testtime wait-k, which demonstrates the effectiveness of the SimulSpeech
  • We further introduced several techniques including data-level and attention-level knowledge distillation to boost the accuracy of SimulSpeech
Results
  • The authors use the best path decoding (Graves et al, 2006) to decide the word boundary without seeing subsequent speech frames, which is consistent with the masked self-attention in speech encoder, i.e., the output of segmenter for position i depends only on the inputs at positions preceding i.
  • In order to obtain the attention weights of simultaneous ASR and NMT, the authors add auxiliary simultaneous ASR and NMT tasks which share the same encoder or decoder with SimulSpeech model respectively, as shown in Figure 2b.
  • To better train the SimulSpeech model, the authors propose a novel attention-level knowledge distillation that is specially designed for speech to text translation, 4.3 Data-Level Knowledge Distillation
  • The authors train a full-sentence NMT teacher model first and generate target text y given source text y that is paired with source speech x.
  • The authors train the student SimulSpeech with the generated target text y which is paired with the source speech x.
  • Comparison with Cascaded Models the authors implement the cascaded simultaneous speech to text translation pipeline and compare the accuracy of SimulSpeech with it under the same translation
  • From the BLEU scores in Row 2 and Row 3, it can be seen that the translation accuracy with different wait-k can be boosted by adding auxiliary task to naive simultaneous speech to text translation model.
  • To-end speech to text translation and achieved comparable accuracy to the cascaded models.
  • Ma et al (2018) introduced a very simple but effective wait-k strategy for simultaneous NMT based on a prefix-to-prefix framework, which predicts the target word conditioned on the partial source sequence the model has seen, instead of the full source sequence.
Conclusion
  • Works on speech to text translation rely on a two-stage method by cascaded ASR and NMT models.
  • The authors developed SimulSpeech, an endto-end simultaneous speech to text translation system that directly translates source speech into target text concurrently.
  • SimulSpeech consists of a speech encoder, a speech segmenter, and a text decoder with wait-k strategy for simultaneous translation.
Tables
  • Table1: The number of sentences and the duration of
  • Table2: The BLEU scores of SimulSpeech on the test set of the MuST-C En→Es and En →De dataset. FS denotes training with k=inf
  • Table3: The comparison between two-stage cascaded method and SimulSpeech under different wait-k on En→Es dataset
  • Table4: The ablation studies on En→Es dataset. The baseline model (Naive S2T) is the naive simultaneous speech to text translation model with wait-k policy. We gradually add our techniques on it to evaluate their effectiveness
  • Table5: The BLEU scores of SimulSpeech on En→Es using our speech segmentation method and groundtruth segmentation
Download tables as Excel
Related work
  • 7 Conclusion

    6.1 Speech to Text Translation

    Speech to text translation has been a hot research topic in the field of artificial intelligence recently (Berard et al, 2016; Weiss et al, 2017; Liu et al, 2019). Early works on speech to text translation rely on a two-stage method by cascaded ASR and NMT models. Berard et al (2016) proposed an end-to-end speech to text translation system, which does not leverage source language text during training or inference. Weiss et al (2017) further leveraged an auxiliary ASR model with a shared encoder with the speech to text model, regarding it as a multi-task problem. Vila et al (2018) applied Transformer (Vaswani et al, 2017b) architecture to this task and achieved good accuracy. Bansal et al (2018) explored speech to text translation in the low-resource setting where both data and computation are limited. Sperber et al (2019) proposed a novel attention-passing model for end-
Funding
  • This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), and National Natural Science Foundation of China (Grant No.61751209)
  • This work was also partially funded by Microsoft Research Asia
Reference
  • Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and Sharon Goldwater. 2018. Lowresource speech-to-text translation. arXiv preprint arXiv:1803.09164.
    Findings
  • Alexandre Berard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744.
    Findings
  • Kyunghyun Cho and Masha Esipova. 2016. Can neural machine translation do simultaneous translation? arXiv preprint arXiv:1606.02012.
    Findings
  • Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. 2018. Incremental decoding and training methods for simultaneous translation in neural machine translation. arXiv preprint arXiv:1806.03661.
    Findings
  • Mattia Antonino Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2019. MuST-C: a Multilingual Speech Translation Corpus. In NAACL-HLT, Minneapolis, MN, USA.
    Google ScholarLocate open access versionFindings
  • Christian Fugen, Alex Waibel, and Muntsin Kolss. 2007. Simultaneous translation of lectures and speeches. Machine translation, 21(4):209–252.
    Google ScholarLocate open access versionFindings
  • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 201Convolutional sequence to sequence learning. In ICML, pages 1243–1252. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li. 2016. Learning to translate in realtime with neural machine translation. arXiv preprint arXiv:1610.00388.
    Findings
  • Yoon Kim and Alexander M Rush. 2016. Sequencelevel knowledge distillation. arXiv preprint arXiv:1606.07947.
    Findings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075.
    Findings
  • Mingbo Ma, Liang Huang, Hao Xiong, Kaibo Liu, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, and Haifeng Wang. 2018. Stacl: Simultaneous translation with integrated anticipation and controllable latency. arXiv preprint arXiv:1810.08398.
    Findings
  • Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 20Optimizing segmentation strategies for simultaneous speech translation. In ACL, pages 551–556.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311– 318.
    Google ScholarLocate open access versionFindings
  • Kanishka Rao, Hasim Sak, and Rohit Prabhavalkar. 2017. Exploring architectures, data and units for streaming end-to-end speech recognition with rnntransducer. In ASRU, pages 193–199. IEEE.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
    Google ScholarFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP, pages 4779–4783. IEEE.
    Google ScholarLocate open access versionFindings
  • Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel. 2019. Attention-passing models for robust and data-efficient end-to-end speech translation. TACL, 7:313–325.
    Google ScholarLocate open access versionFindings
  • Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and TieYan Liu. 2019. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461.
    Findings
  • Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. In NIPS 2017, pages 6000–6010.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. In NIPS, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Laura Cross Vila, Carlos Escolano, Jose AR Fonollosa, and Marta R Costa-jussa. 2018. End-to-end speech translation with the transformer. In IberSPEECH, pages 60–63.
    Google ScholarLocate open access versionFindings
  • Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen. 2017. Sequence-tosequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581.
    Findings
  • Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. 2019. Simultaneous translation with flexible policy via restricted imitation learning. arXiv preprint arXiv:1906.01135.
    Findings
  • Chunting Zhou, Graham Neubig, and Jiatao Gu. 2019. Understanding knowledge distillation in nonautoregressive machine translation. arXiv preprint arXiv:1911.02727.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科