AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We measure the mean opinion score to compare the quality of all the audios including ground truth, and our synthesized samples via Amazon Mechanical Turk; the results are shown in Table 1

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

NIPS 2020, (2020)

Cited: 54|Views41560
EI
Full Text
Bibtex
Weibo

Abstract

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantages, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generati...More
0
Introduction
  • Text-to-Speech (TTS) is the task to generate speech from text, and deep-learning-based TTS models have succeeded in producing natural speech indistinguishable from human speech.
  • Among neural TTS models, autoregressive models such as Tacotron 2 (Shen et al, 2018) or Transformer TTS (Li et al, 2019), show the state-of-the-art performance
  • Based on these autoregressive models, there have been many advances in generating diverse speech in terms of modelling different speaking styles or various prosodies (Wang et al, 2018; Skerry-Ryan et al, 2018; Jia et al, 2018).
  • When the input text includes the repeated words, autoregressive TTS models often produce serious attention errors
Highlights
  • Text-to-Speech (TTS) is the task to generate speech from text, and deep-learning-based TTS models have succeeded in producing natural speech indistinguishable from human speech
  • In order to eliminate any dependency on other networks, we introduce Monotonic Alignment Search (MAS), a novel method to search for the most probable monotonic alignment with only text and latent representation of speech
  • We measure the mean opinion score (MOS) to compare the quality of all the audios including ground truth (GT), and our synthesized samples via Amazon Mechanical Turk (AMT); the results are shown in Table 1
  • We propose Glow-TTS, a new type of parallel TTS model, which provides fast and high quality speech synthesis
  • GlowTTS is a flow-based generative model that is directly trained with maximum likelihood estimation and generates a melspectrogram given text in parallel
  • We present Glow-TTS as an alternative to existing TTS models
Methods
  • Glow-TTS was trained for 240K iterations using the Adam Optimizer with the same learning rate schedule in (Vaswani et al, 2017).
  • This required only 3 days with mixed precision training on 2 NVIDIA V100 GPUs. To train Glow-TTS for a multi-speaker setting, the authors add the speaker embedding and increase all hidden dimensions of text encoder and the decoder.
  • The authors trained the GlowTTS for 480K iterations for convergence
Results
  • The authors measure the mean opinion score (MOS) to compare the quality of all the audios including ground truth (GT), and the synthesized samples via Amazon Mechanical Turk (AMT); the results are shown in Table 1.
  • The authors measure the performance of Glow-TTS for various standard deviations of the prior distribution; the temperature of 0.333 shows the best performance.
  • The authors' Glow-TTS shows comparable performance to the strong autoregressive baseline, Tacotron 2
Conclusion
  • The authors propose Glow-TTS, a new type of parallel TTS model, which provides fast and high quality speech synthesis.
  • The authors demonstrate additional advantages of Glow-TTS, such as controlling the speaking rate or the pitch of synthesized speech, robustness to long utterances, and extensibility to a multi-speaker setting.
  • Thanks to these advantages, the authors present Glow-TTS as an alternative to existing TTS models
Tables
  • Table1: The Mean Opinion Score (MOS) of single speaker TTS models with 95% confidence intervals
  • Table2: The Mean Opinion Score (MOS) of a multi-speaker TTS with 95% confidence intervals
  • Table3: Hyper-parameters of Glow-TTS. The total number of parameters is lower than that of FastSpeech (30.1M)
  • Table4: Attention error counts for TTS models for 100 test sentences
Download tables as Excel
Related work
Funding
  • Even though our method is difficult to parallelize, it runs fast on CPU without need of GPU execution. In our experiments, it spends less than 20ms for each iteration, which amounts to less than 2% of the total training time
Study subjects and analysis
samples: 12500
For single speaker TTS, we train our model on the widely used single female speaker dataset LJSpeech (Ito, 2017), which consists of 13100 short audio clips with a total duration of approximately 24 hours. We randomly split the dataset into training set (12500 samples), validation set (100 samples), and test set (500 samples). For multi-speaker TTS, we use the train-clean-100 subset of the LibriTTS corpus (Zen et al, 2019), which consists of about 54 hours audio recording of 247 speakers

datasets: 3
For multi-speaker TTS, we use the train-clean-100 subset of the LibriTTS corpus (Zen et al, 2019), which consists of about 54 hours audio recording of 247 speakers. We first trim the beginning and ending silence of all the audio clips in the data, then filter out all data of which text lengths are larger than 190, and split it into three datasets for training (29181 samples), validation (88 samples), and test (442 samples). Additionally, we collect out-of-distribution text data for robustness test

Reference
  • Battenberg, E., Skerry-Ryan, R., Mariooryad, S., Stanton, D., Kao, D., Shannon, M., and Bagby, T. Locationrelative attention mechanisms for robust long-form speech synthesis. arXiv preprint arXiv:1910.10288, 2019.
    Findings
  • Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
    Findings
  • Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
    Findings
  • Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems, pp. 7509–7520, 2019.
    Google ScholarLocate open access versionFindings
  • Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., and Zhou, Y. Deep voice 2: Multispeaker neural text-to-speech. In Advances in neural information processing systems, pp. 2962–2970, 2017.
    Google ScholarLocate open access versionFindings
  • Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
    Findings
  • Hoogeboom, E., Berg, R. v. d., and Welling, M. Emerging convolutions for generative normalizing flows. arXiv preprint arXiv:1901.11137, 2019.
    Findings
  • Ito, K. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
    Findings
  • Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Moreno, I. L., Wu, Y., et al. Transfer learning from speaker verification to multispeaker textto-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490, 2018.
    Google ScholarLocate open access versionFindings
  • Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A. v. d., Dieleman, S., and Kavukcuoglu, K. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435, 2018.
    Findings
  • Kim, S., Lee, S.-g., Song, J., Kim, J., and Yoon, S. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
    Findings
  • Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
    Findings
  • Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
    Findings
  • Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6706–6713, 2019.
    Google ScholarLocate open access versionFindings
  • Ma, M., Zheng, B., Liu, K., Zheng, R., Liu, H., Peng, K., Church, K., and Huang, L. Incremental text-to-speech synthesis with prefix-to-prefix framework. arXiv preprint arXiv:1911.02750, 2019.
    Findings
  • Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., and Xiao, J. Flow-tts: A non-autoregressive network for text to speech based on flow. In ICASSP 2020-2020 IEEE International
    Google ScholarFindings
  • Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213. IEEE, 2020.
    Google ScholarFindings
  • Oord, A. v. d., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
    Findings
  • Peng, K., Ping, W., Song, Z., and Zhao, K. Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459, 2019.
    Findings
  • Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654, 2017.
    Findings
  • Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–36IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3165–3174, 2019.
    Google ScholarLocate open access versionFindings
  • Serra, J., Pascual, S., and Perales, C. S. Blow: a singlescale hyperconditioned flow for non-parallel raw-audio voice conversion. In Advances in Neural Information Processing Systems, pp. 6790–6800, 2019.
    Google ScholarLocate open access versionFindings
  • Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
    Findings
  • Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., and Saurous, R. A. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. arXiv preprint arXiv:1803.09047, 2018.
    Findings
  • Valle, R. Tacotron 2. https://github.com/ NVIDIA/tacotron2, 2018.
    Findings
  • Valle, R. Waveglow. https://github.com/ NVIDIA/waveglow, 2019.
    Findings
  • Valle, R., Shih, K., Prenger, R., and Catanzaro, B. Flowtron: an autoregressive flow-based generative network for textto-speech synthesis. arXiv preprint arXiv:2005.05957, 2020.
    Findings
  • Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. In SSW, pp. 125, 2016.
    Google ScholarLocate open access versionFindings
  • Van Den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lockhart, E., Cobo, L. C., Stimberg, F., et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
    Findings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R., Battenberg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., and Saurous, R. A. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017, 2018.
    Findings
  • Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., and Wu, Y. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
    Findings
Author
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn