AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
View the video
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We measure the mean opinion score to compare the quality of all the audios including ground truth, and our synthesized samples via Amazon Mechanical Turk; the results are shown in Table 1
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
NIPS 2020, (2020)
Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantages, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generati...More
PPT (Upload PPT)
- Text-to-Speech (TTS) is the task to generate speech from text, and deep-learning-based TTS models have succeeded in producing natural speech indistinguishable from human speech.
- Among neural TTS models, autoregressive models such as Tacotron 2 (Shen et al, 2018) or Transformer TTS (Li et al, 2019), show the state-of-the-art performance
- Based on these autoregressive models, there have been many advances in generating diverse speech in terms of modelling different speaking styles or various prosodies (Wang et al, 2018; Skerry-Ryan et al, 2018; Jia et al, 2018).
- When the input text includes the repeated words, autoregressive TTS models often produce serious attention errors
- Text-to-Speech (TTS) is the task to generate speech from text, and deep-learning-based TTS models have succeeded in producing natural speech indistinguishable from human speech
- In order to eliminate any dependency on other networks, we introduce Monotonic Alignment Search (MAS), a novel method to search for the most probable monotonic alignment with only text and latent representation of speech
- We measure the mean opinion score (MOS) to compare the quality of all the audios including ground truth (GT), and our synthesized samples via Amazon Mechanical Turk (AMT); the results are shown in Table 1
- We propose Glow-TTS, a new type of parallel TTS model, which provides fast and high quality speech synthesis
- GlowTTS is a flow-based generative model that is directly trained with maximum likelihood estimation and generates a melspectrogram given text in parallel
- We present Glow-TTS as an alternative to existing TTS models
- Glow-TTS was trained for 240K iterations using the Adam Optimizer with the same learning rate schedule in (Vaswani et al, 2017).
- This required only 3 days with mixed precision training on 2 NVIDIA V100 GPUs. To train Glow-TTS for a multi-speaker setting, the authors add the speaker embedding and increase all hidden dimensions of text encoder and the decoder.
- The authors trained the GlowTTS for 480K iterations for convergence
- The authors measure the mean opinion score (MOS) to compare the quality of all the audios including ground truth (GT), and the synthesized samples via Amazon Mechanical Turk (AMT); the results are shown in Table 1.
- The authors measure the performance of Glow-TTS for various standard deviations of the prior distribution; the temperature of 0.333 shows the best performance.
- The authors' Glow-TTS shows comparable performance to the strong autoregressive baseline, Tacotron 2
- The authors propose Glow-TTS, a new type of parallel TTS model, which provides fast and high quality speech synthesis.
- The authors demonstrate additional advantages of Glow-TTS, such as controlling the speaking rate or the pitch of synthesized speech, robustness to long utterances, and extensibility to a multi-speaker setting.
- Thanks to these advantages, the authors present Glow-TTS as an alternative to existing TTS models
- Table1: The Mean Opinion Score (MOS) of single speaker TTS models with 95% confidence intervals
- Table2: The Mean Opinion Score (MOS) of a multi-speaker TTS with 95% confidence intervals
- Table3: Hyper-parameters of Glow-TTS. The total number of parameters is lower than that of FastSpeech (30.1M)
- Table4: Attention error counts for TTS models for 100 test sentences
- Text-to-Speech (TTS) Models. TTS models are a family of generative models that synthesize speech from text. One subclass of TTS models, including Tacotron 2 (Shen et al, 2018), Deep Voice 3 (Ping et al, 2017) and Transformer TTS (Li et al, 2019), generates a mel-spectrogram, a compressed representation of audio, from text. They produce natural speech comparable to the human voice. Another subclass, also known as vocoder, has been developed to transform mel-spectrograms into high-fidelity audio waveform (Shen et al, 2018; Van Den Oord et al, 2016) with fast synthesis speed (Kalchbrenner et al, 2018; Van Den Oord et al, 2017; Prenger et al, 2019). It has also been studied to enhance expressiveness of TTS models. Auxiliary embedding methods have been proposed to generate diverse speech by controlling some factors such as intonation and rhythm (Skerry-Ryan et al, 2018; Wang et al, 2018), and some studies have aimed at synthesizing speech in the voices of various speakers (Jia et al, 2018; Gibiansky et al, 2017).
- Even though our method is difficult to parallelize, it runs fast on CPU without need of GPU execution. In our experiments, it spends less than 20ms for each iteration, which amounts to less than 2% of the total training time
Study subjects and analysis
For single speaker TTS, we train our model on the widely used single female speaker dataset LJSpeech (Ito, 2017), which consists of 13100 short audio clips with a total duration of approximately 24 hours. We randomly split the dataset into training set (12500 samples), validation set (100 samples), and test set (500 samples). For multi-speaker TTS, we use the train-clean-100 subset of the LibriTTS corpus (Zen et al, 2019), which consists of about 54 hours audio recording of 247 speakers
For multi-speaker TTS, we use the train-clean-100 subset of the LibriTTS corpus (Zen et al, 2019), which consists of about 54 hours audio recording of 247 speakers. We first trim the beginning and ending silence of all the audio clips in the data, then filter out all data of which text lengths are larger than 190, and split it into three datasets for training (29181 samples), validation (88 samples), and test (442 samples). Additionally, we collect out-of-distribution text data for robustness test
- Battenberg, E., Skerry-Ryan, R., Mariooryad, S., Stanton, D., Kao, D., Shannon, M., and Bagby, T. Locationrelative attention mechanisms for robust long-form speech synthesis. arXiv preprint arXiv:1910.10288, 2019.
- Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems, pp. 7509–7520, 2019.
- Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., and Zhou, Y. Deep voice 2: Multispeaker neural text-to-speech. In Advances in neural information processing systems, pp. 2962–2970, 2017.
- Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
- Hoogeboom, E., Berg, R. v. d., and Welling, M. Emerging convolutions for generative normalizing flows. arXiv preprint arXiv:1901.11137, 2019.
- Ito, K. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
- Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Moreno, I. L., Wu, Y., et al. Transfer learning from speaker verification to multispeaker textto-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490, 2018.
- Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A. v. d., Dieleman, S., and Kavukcuoglu, K. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435, 2018.
- Kim, S., Lee, S.-g., Song, J., Kim, J., and Yoon, S. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
- Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
- Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
- Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6706–6713, 2019.
- Ma, M., Zheng, B., Liu, K., Zheng, R., Liu, H., Peng, K., Church, K., and Huang, L. Incremental text-to-speech synthesis with prefix-to-prefix framework. arXiv preprint arXiv:1911.02750, 2019.
- Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., and Xiao, J. Flow-tts: A non-autoregressive network for text to speech based on flow. In ICASSP 2020-2020 IEEE International
- Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213. IEEE, 2020.
- Oord, A. v. d., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
- Peng, K., Ping, W., Song, Z., and Zhao, K. Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459, 2019.
- Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654, 2017.
- Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–36IEEE, 2019.
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3165–3174, 2019.
- Serra, J., Pascual, S., and Perales, C. S. Blow: a singlescale hyperconditioned flow for non-parallel raw-audio voice conversion. In Advances in Neural Information Processing Systems, pp. 6790–6800, 2019.
- Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, 2018.
- Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., and Saurous, R. A. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. arXiv preprint arXiv:1803.09047, 2018.
- Valle, R. Tacotron 2. https://github.com/ NVIDIA/tacotron2, 2018.
- Valle, R. Waveglow. https://github.com/ NVIDIA/waveglow, 2019.
- Valle, R., Shih, K., Prenger, R., and Catanzaro, B. Flowtron: an autoregressive flow-based generative network for textto-speech synthesis. arXiv preprint arXiv:2005.05957, 2020.
- Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. In SSW, pp. 125, 2016.
- Van Den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lockhart, E., Cobo, L. C., Stimberg, F., et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R., Battenberg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., and Saurous, R. A. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017, 2018.
- Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., and Wu, Y. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.