We report the experiment results of UWSpeech
UWSpeech: Speech to Speech Translation for Unwritten Languages
下载 PDF 全文
Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages,...更多
下载 PDF 全文
- Speech to speech translation (Lavie et al 1997; Nakamura et al 2006; Wahlster 2013; Jia et al 2019) is important to help the understanding of cross-lingual spoken conversations and lectures, and has been used in scenarios such as international travel or conference.
- The authors focus on the most difficult setting: speech to speech translation for unwritten languages
- In this way, the authors can not leverage any source or target text in auxiliary tasks like in Jia et al (2019).
- The authors' method can be applied to the written target languages whose text or phonetic transcripts are not available in the training data
- Speech to speech translation (Lavie et al 1997; Nakamura et al 2006; Wahlster 2013; Jia et al 2019) is important to help the understanding of cross-lingual spoken conversations and lectures, and has been used in scenarios such as international travel or conference
- We develop UWSpeech (UW is short for UnWritten), a translation system for unwritten languages with three key components: 1) a converter that transforms unwritten target speech into discrete tokens, 2) a translator that translates source-language speech into target-language discrete tokens, and 3) an inverter that converts the translated discrete tokens back to unwritten target speech
- We develop UWSpeech, a speech to speech translation system for unwritten languages, and design a novel XLVAE to train the converter and inverter in UWSpeech jointly for discrete speech representations
- We report the experiment results of UWSpeech
- We developed UWSpeech, a speech to speech translation system for unwritten target languages, and designed XL-VAE, an enhanced version of vector quantized variational autoencoder (VQ-VAE) based on cross-lingual speech recognition, to jointly train the converter and inverter to discretize and reconstruct the unwritten speech in UWSpeech
- Experiments on Fisher SpanishEnglish dataset demonstrate that UWSpeech equipped with XL-VAE achieves significant improvements in translation accuracy over the direct translation and VQ-VAE baseline
- Method Analyses
In this subsection, the authors conduct some experimental analyses on the proposed UWSpeech.
- Analyses of Written Languages in XL-VAE The authors study the influence of written languages in XL-VAE on the translation accuracy, mainly from two perspectives: 1) the data amount of the written languages, and 2) the similarity between the written and unwritten languages
- Comparing setting #4 with #3, the authors can find that further adding other languages (French and Chinese) to increase the total data amount can improve translation accuracy.
- The authors first introduce the experimental setup and report the results of UWSpeech for speech to speech translation.
- The authors further conduct some analyses of UWSpeech.
- The authors apply UWSpeech to text to speech translation and speech to text translation settings.
- The authors report the experiment results of UWSpeech.
- The authors compare UWSpeech mainly with two base-.
- 2) Discretization with VQ-VAE, which follows the translation pipeline in UWSpeech but replaces XL-VAE with original VQ-VAE for speech discretization
- 4https://github.com/tensorflow/tensor2tensor 5https://github.com/moses-smt/mosesdecoder/blob/master/ scripts/tokenizer/tokenizer.perl 6https://github.com/moses-smt/mosesdecoder/blob/master/ scripts/generic/multi-bleu.perl lines: 1) Direct Translation, which directly translates the source speech into target speech in an encoder-attentiondecoder model without any text as auxiliary training data or pivots.
- The authors developed UWSpeech, a speech to speech translation system for unwritten target languages, and designed XL-VAE, an enhanced version of VQ-VAE based on cross-lingual speech recognition, to jointly train the converter and inverter to discretize and reconstruct the unwritten speech in UWSpeech.
- The authors will enhance XL-VAE with domain adversarial training to better transfer the speech recognition ability from written languages to unwritten languages.
- Going beyond the proof-ofconcept experiments in this work, the authors will apply UWSpeech on truly unwritten languages for speech to speech translation
- Table1: The BLEU scores of Spanish to English speech to speech translation, where English is taken as the unwritten language
- Table2: The BLEU scores of English to Spanish speech to speech translation, where Spanish is taken as the unwritten language
- Table3: The BLEU scores of Spanish to English speech to speech translation, combines with multi-task training in different ways
- Table4: The BLEU scores of Spanish to English speech to speech translation with different written languages as well as different data amounts for XL-VAE. We denote German as De, French as Fr and Chinese as Zh
- Table5: The BLEU scores of Spanish to English translation with different discrete token embedding sizes and downsampling ratios
- Table6: Some translation cases in Spanish to English speech to speech translation
- Table7: The BLEU scores of the text to speech and speech to text setting on Spanish to English translation, where English and Spanish is taken as the unwritten language in the text to speech setting and speech to text setting respectively
- This work was supported by the Key Project of National Science Foundation of Zhejiang Province (No LZ19F020002)
- This work was also partially funded by Microsoft Research Asia
- Adams, O. 2017. Automatic understanding of unwritten languages. Ph.D. thesis.
- Association, I. P.; Staff, I. P. A.; et al. 1999. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press.
- Baevski, A.; Schneider, S.; and Auli, M. 2019. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. arXiv preprint arXiv:1910.05453.
- Bahdanau, D.; Cho, K.; and Bengio, Y. 201Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Berard, A.; Pietquin, O.; Servan, C.; and Besacier, L. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744.
- Chen, H.; Leung, C.-C.; Xie, L.; Ma, B.; and Li, H. 2015. Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study. In Sixteenth Annual Conference of the International Speech Communication Association.
- Chorowski, J.; Weiss, R. J.; Bengio, S.; and van den Oord, A. 2019. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing 27(12): 2041–2053.
- Dumoulin, V.; and Visin, F. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.
- Dunbar, E.; Algayres, R.; Karadayi, J.; Bernard, M.; Benjumea, J.; Cao, X.-N.; Miskic, L.; Dugrain, C.; Ondel, L.; Black, A. W.; and et al. 201The Zero Resource Speech Challenge 2019: TTS Without T. Interspeech 2019 doi: 10.21437/interspeech.2019-2904. URL http://dx.doi.org/10.21437/interspeech.2019-2904.
- Dunbar, E.; Cao, X. N.; Benjumea, J.; Karadayi, J.; Bernard, M.; Besacier, L.; Anguera, X.; and Dupoux, E. 2017. The zero resource speech challenge 2017. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 323–330. IEEE.
- Duong, L.; Anastasopoulos, A.; Chiang, D.; Bird, S.; and Cohn, T. 2016. An attentional model for speech translation without transcription. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 949–959.
- Eloff, R.; Nortje, A.; van Niekerk, B.; Govender, A.; Nortje, L.; Pretorius, A.; van Biljon, E.; van der Westhuizen, E.; van Staden, L.; and Kamper, H. 2019. Unsupervised acoustic unit discovery for speech synthesis using discrete latentvariable neural networks.
- Godard, P.; Adda, G.; Adda-Decker, M.; Benjumea, J.; Besacier, L.; Cooper-Leavitt, J.; Kouarata, G.-N.; Lamel, L.; Maynard, H.; Muller, M.; et al. 2017. A very low resource language speech corpus for computational language documentation experiments. arXiv preprint arXiv:1710.03501.
- Graves, A.; Fernandez, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369–376. ACM.
- Griffin, D.; and Lim, J. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2): 236–243.
- Hui Bu, Jiayu Du, X. N. B. W. H. Z. 2017. AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In Oriental COCOSDA 2017, Submitted.
- Jia, Y.; Weiss, R. J.; Biadsy, F.; Macherey, W.; Johnson, M.; Chen, Z.; and Wu, Y. 2019. Direct speech-to-speech translation with a sequence-to-sequence model.
- Kamper, H.; Livescu, K.; and Goldwater, S. 2017. An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 719–726. IEEE.
- Kuhl, P. K.; Conboy, B. T.; Coffey-Corina, S.; Padden, D.; Rivera-Gaxiola, M.; and Nelson, T. 2008. Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B: Biological Sciences 363(1493): 979–1000.
- Lample, G.; Conneau, A.; Denoyer, L.; and Ranzato, M. 2018. Unsupervised Machine Translation Using Monolingual Corpora Only.
- Lavie, A.; Waibel, A.; Levin, L.; Finke, M.; Gates, D.; Gavalda, M.; Zeppenfeld, T.; and Zhan, P. 1997. JANUSIII: Speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, 99–102. IEEE.
- Lewis, M. P.; and Gary, F. 2013.
- Simons, and Charles D. Fennig (eds.).(2015). Ethnologue: Languages of the World, Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.
- Li, X.; Dalmia, S.; Mortensen, D. R.; Li, J.; Black, A. W.; and Metze, F. 2020. Towards zero-shot learning for automatic phonemic transcription. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Liu, A. H.; Tu, T.; Lee, H.-y.; and Lee, L.-s. 2019. Towards Unsupervised Speech Recognition and Synthesis with
- Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1412–1421.
- Matusov, E.; Kanthak, S.; and Ney, H. 2005. On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology.
- Muthukumar, P. K.; and Black, A. W. 2014. Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2594–2598. IEEE.
- Nakamura, S.; Markov, K.; Nakaiwa, H.; Kikui, G.-i.; Kawai, H.; Jitsuhiro, T.; Zhang, J.-S.; Yamamoto, H.; Sumita, E.; and Yamamoto, S. 2006. The ATR multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing 14(2): 365–376.
- Ney, H. 1999. Speech translation: Coupling of recognition and translation. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), volume 1, 517–520. IEEE.
- Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
- Post, M.; Kumar, G.; Lopez, A.; Karakos, D.; CallisonBurch, C.; and Khudanpur, S. 2013. Improved speechto-text translation with the Fisher and Callhome Spanish– English speech translation corpus. In Proc. IWSLT.
- Ren, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; and Liu, T.-Y. 2019. Almost Unsupervised Text to Speech and Automatic Speech Recognition. In International Conference on Machine Learning, 5410–5419.
- Salesky, E.; Sperber, M.; and Black, A. W. 2019. Exploring phoneme-level speech representations for end-to-end speech translation. arXiv preprint arXiv:1906.01199.
- Scharenborg, O.; Besacier, L.; Black, A.; HasegawaJohnson, M.; Metze, F.; Neubig, G.; Stuker, S.; Godard, P.; Muller, M.; Ondel, L.; et al. 2020. Speech technology for unwritten languages. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28: 964–975.
- Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. arXiv preprint arXiv:1905.02450.
- Sperber, M.; Neubig, G.; Niehues, J.; and Waibel, A. 2019. Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation. TACL 7: 313–325.
- Tjandra, A.; Sakti, S.; and Nakamura, S. 2019. Speech-tospeech Translation between Untranscribed Unknown Languages. arXiv preprint arXiv:1910.00795.
- Tjandra, A.; Sisman, B.; Zhang, M.; Sakti, S.; Li, H.; and Nakamura, S. 2019. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019.
- van den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 6306–6315.
- Vaswani, A.; Bengio, S.; Brevdo, E.; Chollet, F.; Gomez, A.; Gouws, S.; Jones, L.; Kaiser, Ł.; Kalchbrenner, N.; Parmar, N.; et al. 2018. Tensor2Tensor for Neural Machine Translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), 193–199.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 5998–6008.
- Versteegh, M.; Anguera, X.; Jansen, A.; and Dupoux, E. 2016. The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science 81: 67– 72.
- Vigliocco, G.; Vinson, D. P.; Lewis, W.; and Garrett, M. F. 2004. Representing the meanings of object and action words: The featural and unitary semantic space hypothesis. Cognitive psychology 48(4): 422–488.
- Vila, L. C.; Escolano, C.; Fonollosa, J. A.; and Costa-jussa, M. R. 2018. End-to-End Speech Translation with the Transformer. In IberSPEECH, 60–63.
- Wahlster, W. 2013. Verbmobil: foundations of speech-tospeech translation. Springer Science & Business Media.
- Weiss, R. J.; Chorowski, J.; Jaitly, N.; Wu, Y.; and Chen, Z. 2017. Sequence-to-Sequence Models Can Directly Translate Foreign Speech.
- Wilkinson, A.; Zhao, T.; and Black, A. W. 2016. Deriving Phonetic Transcriptions and Discovering Word Segmentations for Speech-to-Speech Translation in Low-Resource Settings. In INTERSPEECH, 3086–3090.
- Wind, J. 1989. The evolutionary history of the human speech organs. Studies in language origins 1: 173–197.
- Yallop, C.; and Fletcher, J. 2007. An introduction to phonetics and phonology.