AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We report the experiment results of UWSpeech

UWSpeech: Speech to Speech Translation for Unwritten Languages

被引用0|浏览191
下载 PDF 全文
引用
微博一下

摘要

Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages,...更多

代码

数据

0
简介
  • Speech to speech translation (Lavie et al 1997; Nakamura et al 2006; Wahlster 2013; Jia et al 2019) is important to help the understanding of cross-lingual spoken conversations and lectures, and has been used in scenarios such as international travel or conference.
  • The authors focus on the most difficult setting: speech to speech translation for unwritten languages
  • In this way, the authors can not leverage any source or target text in auxiliary tasks like in Jia et al (2019).
  • The authors' method can be applied to the written target languages whose text or phonetic transcripts are not available in the training data
重点内容
  • Speech to speech translation (Lavie et al 1997; Nakamura et al 2006; Wahlster 2013; Jia et al 2019) is important to help the understanding of cross-lingual spoken conversations and lectures, and has been used in scenarios such as international travel or conference
  • We develop UWSpeech (UW is short for UnWritten), a translation system for unwritten languages with three key components: 1) a converter that transforms unwritten target speech into discrete tokens, 2) a translator that translates source-language speech into target-language discrete tokens, and 3) an inverter that converts the translated discrete tokens back to unwritten target speech
  • We develop UWSpeech, a speech to speech translation system for unwritten languages, and design a novel XLVAE to train the converter and inverter in UWSpeech jointly for discrete speech representations
  • We report the experiment results of UWSpeech
  • We developed UWSpeech, a speech to speech translation system for unwritten target languages, and designed XL-VAE, an enhanced version of vector quantized variational autoencoder (VQ-VAE) based on cross-lingual speech recognition, to jointly train the converter and inverter to discretize and reconstruct the unwritten speech in UWSpeech
  • Experiments on Fisher SpanishEnglish dataset demonstrate that UWSpeech equipped with XL-VAE achieves significant improvements in translation accuracy over the direct translation and VQ-VAE baseline
方法
  • Method Analyses

    In this subsection, the authors conduct some experimental analyses on the proposed UWSpeech.
  • Analyses of Written Languages in XL-VAE The authors study the influence of written languages in XL-VAE on the translation accuracy, mainly from two perspectives: 1) the data amount of the written languages, and 2) the similarity between the written and unwritten languages
  • Comparing setting #4 with #3, the authors can find that further adding other languages (French and Chinese) to increase the total data amount can improve translation accuracy.
结果
  • The authors first introduce the experimental setup and report the results of UWSpeech for speech to speech translation.
  • The authors further conduct some analyses of UWSpeech.
  • The authors apply UWSpeech to text to speech translation and speech to text translation settings.
  • The authors report the experiment results of UWSpeech.
  • The authors compare UWSpeech mainly with two base-.
  • 2) Discretization with VQ-VAE, which follows the translation pipeline in UWSpeech but replaces XL-VAE with original VQ-VAE for speech discretization
  • 4https://github.com/tensorflow/tensor2tensor 5https://github.com/moses-smt/mosesdecoder/blob/master/ scripts/tokenizer/tokenizer.perl 6https://github.com/moses-smt/mosesdecoder/blob/master/ scripts/generic/multi-bleu.perl lines: 1) Direct Translation, which directly translates the source speech into target speech in an encoder-attentiondecoder model without any text as auxiliary training data or pivots.
结论
  • The authors developed UWSpeech, a speech to speech translation system for unwritten target languages, and designed XL-VAE, an enhanced version of VQ-VAE based on cross-lingual speech recognition, to jointly train the converter and inverter to discretize and reconstruct the unwritten speech in UWSpeech.
  • The authors will enhance XL-VAE with domain adversarial training to better transfer the speech recognition ability from written languages to unwritten languages.
  • Going beyond the proof-ofconcept experiments in this work, the authors will apply UWSpeech on truly unwritten languages for speech to speech translation
表格
  • Table1: The BLEU scores of Spanish to English speech to speech translation, where English is taken as the unwritten language
  • Table2: The BLEU scores of English to Spanish speech to speech translation, where Spanish is taken as the unwritten language
  • Table3: The BLEU scores of Spanish to English speech to speech translation, combines with multi-task training in different ways
  • Table4: The BLEU scores of Spanish to English speech to speech translation with different written languages as well as different data amounts for XL-VAE. We denote German as De, French as Fr and Chinese as Zh
  • Table5: The BLEU scores of Spanish to English translation with different discrete token embedding sizes and downsampling ratios
  • Table6: Some translation cases in Spanish to English speech to speech translation
  • Table7: The BLEU scores of the text to speech and speech to text setting on Spanish to English translation, where English and Spanish is taken as the unwritten language in the text to speech setting and speech to text setting respectively
Download tables as Excel
基金
  • This work was supported by the Key Project of National Science Foundation of Zhejiang Province (No LZ19F020002)
  • This work was also partially funded by Microsoft Research Asia
引用论文
  • Adams, O. 2017. Automatic understanding of unwritten languages. Ph.D. thesis.
    Google ScholarFindings
  • Association, I. P.; Staff, I. P. A.; et al. 1999. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press.
    Google ScholarFindings
  • Baevski, A.; Schneider, S.; and Auli, M. 2019. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. arXiv preprint arXiv:1910.05453.
    Findings
  • Bahdanau, D.; Cho, K.; and Bengio, Y. 201Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Berard, A.; Pietquin, O.; Servan, C.; and Besacier, L. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744.
    Findings
  • Chen, H.; Leung, C.-C.; Xie, L.; Ma, B.; and Li, H. 2015. Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study. In Sixteenth Annual Conference of the International Speech Communication Association.
    Google ScholarLocate open access versionFindings
  • Chorowski, J.; Weiss, R. J.; Bengio, S.; and van den Oord, A. 2019. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing 27(12): 2041–2053.
    Google ScholarFindings
  • Dumoulin, V.; and Visin, F. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.
    Findings
  • Dunbar, E.; Algayres, R.; Karadayi, J.; Bernard, M.; Benjumea, J.; Cao, X.-N.; Miskic, L.; Dugrain, C.; Ondel, L.; Black, A. W.; and et al. 201The Zero Resource Speech Challenge 2019: TTS Without T. Interspeech 2019 doi: 10.21437/interspeech.2019-2904. URL http://dx.doi.org/10.21437/interspeech.2019-2904.
    Findings
  • Dunbar, E.; Cao, X. N.; Benjumea, J.; Karadayi, J.; Bernard, M.; Besacier, L.; Anguera, X.; and Dupoux, E. 2017. The zero resource speech challenge 2017. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 323–330. IEEE.
    Google ScholarLocate open access versionFindings
  • Duong, L.; Anastasopoulos, A.; Chiang, D.; Bird, S.; and Cohn, T. 2016. An attentional model for speech translation without transcription. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 949–959.
    Google ScholarLocate open access versionFindings
  • Eloff, R.; Nortje, A.; van Niekerk, B.; Govender, A.; Nortje, L.; Pretorius, A.; van Biljon, E.; van der Westhuizen, E.; van Staden, L.; and Kamper, H. 2019. Unsupervised acoustic unit discovery for speech synthesis using discrete latentvariable neural networks.
    Google ScholarFindings
  • Godard, P.; Adda, G.; Adda-Decker, M.; Benjumea, J.; Besacier, L.; Cooper-Leavitt, J.; Kouarata, G.-N.; Lamel, L.; Maynard, H.; Muller, M.; et al. 2017. A very low resource language speech corpus for computational language documentation experiments. arXiv preprint arXiv:1710.03501.
    Findings
  • Graves, A.; Fernandez, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369–376. ACM.
    Google ScholarLocate open access versionFindings
  • Griffin, D.; and Lim, J. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2): 236–243.
    Google ScholarLocate open access versionFindings
  • Hui Bu, Jiayu Du, X. N. B. W. H. Z. 2017. AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In Oriental COCOSDA 2017, Submitted.
    Google ScholarLocate open access versionFindings
  • Jia, Y.; Weiss, R. J.; Biadsy, F.; Macherey, W.; Johnson, M.; Chen, Z.; and Wu, Y. 2019. Direct speech-to-speech translation with a sequence-to-sequence model.
    Google ScholarFindings
  • Kamper, H.; Livescu, K.; and Goldwater, S. 2017. An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 719–726. IEEE.
    Google ScholarLocate open access versionFindings
  • Kuhl, P. K.; Conboy, B. T.; Coffey-Corina, S.; Padden, D.; Rivera-Gaxiola, M.; and Nelson, T. 2008. Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B: Biological Sciences 363(1493): 979–1000.
    Google ScholarLocate open access versionFindings
  • Lample, G.; Conneau, A.; Denoyer, L.; and Ranzato, M. 2018. Unsupervised Machine Translation Using Monolingual Corpora Only.
    Google ScholarFindings
  • Lavie, A.; Waibel, A.; Levin, L.; Finke, M.; Gates, D.; Gavalda, M.; Zeppenfeld, T.; and Zhan, P. 1997. JANUSIII: Speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, 99–102. IEEE.
    Google ScholarLocate open access versionFindings
  • Lewis, M. P.; and Gary, F. 2013.
    Google ScholarFindings
  • Simons, and Charles D. Fennig (eds.).(2015). Ethnologue: Languages of the World, Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.
    Findings
  • Li, X.; Dalmia, S.; Mortensen, D. R.; Li, J.; Black, A. W.; and Metze, F. 2020. Towards zero-shot learning for automatic phonemic transcription. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Liu, A. H.; Tu, T.; Lee, H.-y.; and Lee, L.-s. 2019. Towards Unsupervised Speech Recognition and Synthesis with
    Google ScholarFindings
  • Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1412–1421.
    Google ScholarLocate open access versionFindings
  • Matusov, E.; Kanthak, S.; and Ney, H. 2005. On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology.
    Google ScholarLocate open access versionFindings
  • Muthukumar, P. K.; and Black, A. W. 2014. Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2594–2598. IEEE.
    Google ScholarLocate open access versionFindings
  • Nakamura, S.; Markov, K.; Nakaiwa, H.; Kikui, G.-i.; Kawai, H.; Jitsuhiro, T.; Zhang, J.-S.; Yamamoto, H.; Sumita, E.; and Yamamoto, S. 2006. The ATR multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing 14(2): 365–376.
    Google ScholarLocate open access versionFindings
  • Ney, H. 1999. Speech translation: Coupling of recognition and translation. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), volume 1, 517–520. IEEE.
    Google ScholarLocate open access versionFindings
  • Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
    Findings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Post, M.; Kumar, G.; Lopez, A.; Karakos, D.; CallisonBurch, C.; and Khudanpur, S. 2013. Improved speechto-text translation with the Fisher and Callhome Spanish– English speech translation corpus. In Proc. IWSLT.
    Google ScholarLocate open access versionFindings
  • Ren, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; and Liu, T.-Y. 2019. Almost Unsupervised Text to Speech and Automatic Speech Recognition. In International Conference on Machine Learning, 5410–5419.
    Google ScholarLocate open access versionFindings
  • Salesky, E.; Sperber, M.; and Black, A. W. 2019. Exploring phoneme-level speech representations for end-to-end speech translation. arXiv preprint arXiv:1906.01199.
    Findings
  • Scharenborg, O.; Besacier, L.; Black, A.; HasegawaJohnson, M.; Metze, F.; Neubig, G.; Stuker, S.; Godard, P.; Muller, M.; Ondel, L.; et al. 2020. Speech technology for unwritten languages. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28: 964–975.
    Google ScholarLocate open access versionFindings
  • Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. arXiv preprint arXiv:1905.02450.
    Findings
  • Sperber, M.; Neubig, G.; Niehues, J.; and Waibel, A. 2019. Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation. TACL 7: 313–325.
    Google ScholarLocate open access versionFindings
  • Tjandra, A.; Sakti, S.; and Nakamura, S. 2019. Speech-tospeech Translation between Untranscribed Unknown Languages. arXiv preprint arXiv:1910.00795.
    Findings
  • Tjandra, A.; Sisman, B.; Zhang, M.; Sakti, S.; Li, H.; and Nakamura, S. 2019. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019.
    Google ScholarFindings
  • van den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 6306–6315.
    Google ScholarLocate open access versionFindings
  • Vaswani, A.; Bengio, S.; Brevdo, E.; Chollet, F.; Gomez, A.; Gouws, S.; Jones, L.; Kaiser, Ł.; Kalchbrenner, N.; Parmar, N.; et al. 2018. Tensor2Tensor for Neural Machine Translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), 193–199.
    Google ScholarLocate open access versionFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 5998–6008.
    Google ScholarLocate open access versionFindings
  • Versteegh, M.; Anguera, X.; Jansen, A.; and Dupoux, E. 2016. The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science 81: 67– 72.
    Google ScholarLocate open access versionFindings
  • Vigliocco, G.; Vinson, D. P.; Lewis, W.; and Garrett, M. F. 2004. Representing the meanings of object and action words: The featural and unitary semantic space hypothesis. Cognitive psychology 48(4): 422–488.
    Google ScholarLocate open access versionFindings
  • Vila, L. C.; Escolano, C.; Fonollosa, J. A.; and Costa-jussa, M. R. 2018. End-to-End Speech Translation with the Transformer. In IberSPEECH, 60–63.
    Google ScholarLocate open access versionFindings
  • Wahlster, W. 2013. Verbmobil: foundations of speech-tospeech translation. Springer Science & Business Media.
    Google ScholarFindings
  • Weiss, R. J.; Chorowski, J.; Jaitly, N.; Wu, Y.; and Chen, Z. 2017. Sequence-to-Sequence Models Can Directly Translate Foreign Speech.
    Google ScholarFindings
  • Wilkinson, A.; Zhao, T.; and Black, A. W. 2016. Deriving Phonetic Transcriptions and Discovering Word Segmentations for Speech-to-Speech Translation in Low-Resource Settings. In INTERSPEECH, 3086–3090.
    Google ScholarFindings
  • Wind, J. 1989. The evolutionary history of the human speech organs. Studies in language origins 1: 173–197.
    Google ScholarLocate open access versionFindings
  • Yallop, C.; and Fletcher, J. 2007. An introduction to phonetics and phonology.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科