Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
Weibo:
Abstract:
We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hund...More
Code:
Data:
Introduction
- Modern text-to-speech (TTS) synthesis systems, using deep neural networks trained on large quantities of data, are able to generate natural speech.
- A synthesis model predicts intermediate audio features from text, typically spectrograms [1, 2], vocoder features [3], or linguistic features [4], controlling the long-term structure of the generated speech
- This is followed by a neural vocoder which converts the features to time-domain waveform samples, filling in low-level signal detail.
- Other approaches have eschewed probabilistic models and used GANs [10, 11], or carefully constructed spectral losses [12, 13, 14]
Highlights
- Modern text-to-speech (TTS) synthesis systems, using deep neural networks trained on large quantities of data, are able to generate natural speech
- As can be seen below, Mel spectral distortion (MSD) and Mel cepstral distortion (MCD) tend to negatively correlate with subjective mean opinion score (MOS), we primarily focus on MOS
- We have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop
- The hybrid model structure combines the simplicity of attention-based TTS models with the parallel generation capabilities of a normalizing flow to generate waveform samples directly
- Exploring the feasibility of adapting the proposed model to fully parallel TTS generation remains an interesting direction for future work
Methods
- The authors experiment with two single-speaker datasets: A proprietary dataset containing about 39 hours of speech, sampled at 24 kHz, from a professional female voice talent which was used in previous studies [4, 1, 2].
- As a lower bound on performance, the authors jointly train a Tacotron model with a post-net (Tacotron-PN), consisting of a 20-layer non-causal WaveNet stack split into two dilation cycles
- This converts mel spectrograms output by the decoder to full linear frequency spectrograms which are inverted to waveform samples using 100 iterations of the Griffin-Lim algorithm [29], similar to [1]
Conclusion
- The authors have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop.
- The network structure exposes the output frame size K as a hyperparameter, the decoder remains fundamentally autoregressive, requiring sequential generation of output frames.
- This puts the approach at a disadvantage compared to recent advances in parallel TTS [32, 33], unless the output step size can be made very large.
- It would be interesting to explore more efficient alternatives to flows in a similar text-to-waveform setting, e.g., diffusion probabilistic models [34], or GANs [11, 19], which can be more optimized in a mode-seeking fashion that is likely to be more efficient than modeling the full data distribution
Summary
Introduction:
Modern text-to-speech (TTS) synthesis systems, using deep neural networks trained on large quantities of data, are able to generate natural speech.- A synthesis model predicts intermediate audio features from text, typically spectrograms [1, 2], vocoder features [3], or linguistic features [4], controlling the long-term structure of the generated speech
- This is followed by a neural vocoder which converts the features to time-domain waveform samples, filling in low-level signal detail.
- Other approaches have eschewed probabilistic models and used GANs [10, 11], or carefully constructed spectral losses [12, 13, 14]
Methods:
The authors experiment with two single-speaker datasets: A proprietary dataset containing about 39 hours of speech, sampled at 24 kHz, from a professional female voice talent which was used in previous studies [4, 1, 2].- As a lower bound on performance, the authors jointly train a Tacotron model with a post-net (Tacotron-PN), consisting of a 20-layer non-causal WaveNet stack split into two dilation cycles
- This converts mel spectrograms output by the decoder to full linear frequency spectrograms which are inverted to waveform samples using 100 iterations of the Griffin-Lim algorithm [29], similar to [1]
Conclusion:
The authors have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop.- The network structure exposes the output frame size K as a hyperparameter, the decoder remains fundamentally autoregressive, requiring sequential generation of output frames.
- This puts the approach at a disadvantage compared to recent advances in parallel TTS [32, 33], unless the output step size can be made very large.
- It would be interesting to explore more efficient alternatives to flows in a similar text-to-waveform setting, e.g., diffusion probabilistic models [34], or GANs [11, 19], which can be more optimized in a mode-seeking fashion that is likely to be more efficient than modeling the full data distribution
Tables
- Table1: TTS performance on the proprietary single speaker dataset
- Table2: TTS performance on LJ Speech with character inputs
- Table3: Generation speed in seconds, comparing one TPU v3 core to a 6-core Intel Xeon W-2135 CPU, generating 5 seconds of speech conditioned on 90 input tokens, batch size 1. Average of 500 trials
- Table4: Ablations on the proprietary dataset using phone inputs and a shallow decoder residual LSTM stack of 2 layers with 256 units. Unless otherwise specified, samples are generated using T = 0.8
Reference
- Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017.
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
- J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech synthesis,” in Proc. International Conference on Learning Representations, 2017.
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, et al., “WaveNet: A generative model for raw audio,” CoRR abs/1609.03499, 2016.
- N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, et al., “Efficient neural audio synthesis,” in Proc. International Conference on Machine Learning, 2018.
- A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, et al., “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,” in Proc. International Conference on Machine Learning, 2018.
- W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech,” in Proc. International Conference on Learning Representations, 2019.
- S. Kim, S.-G. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet: A generative flow for raw audio,” in Proc. International Conference on Machine Learning, 2019.
- R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flowbased Generative Network for Speech Synthesis,” in Proc. ICASSP, 2019.
- K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in Advances in Neural Information Processing Systems, 2019.
- M. Binkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High Fidelity Speech Synthesis with Adversarial Networks,” in Proc. International Conference on Learning Representations, 2020.
- S. O. Arık, H. Jun, and G. Diamos, “Fast spectrogram inversion using multi-head convolutional neural networks,” IEEE Signal Processing Letters, vol. 26, no. 1, pp. 94–98, 2018.
- X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filterbased waveform model for statistical parametric speech synthesis,” in Proc. ICASSP, 2019, pp. 5916–5920.
- A. A. Gritsenko, T. Salimans, R. v. d. Berg, J. Snoek, and N. Kalchbrenner, “A Spectral Energy Distance for Parallel Speech Synthesis,” arXiv preprint arXiv:2008.01160, 2020.
- C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-TTS: A non-autoregressive network for text to speech based on flow,” in Proc. ICASSP, 2020, pp. 7209–7213.
- R. Valle, K. Shih, R. Prenger, and B. Catanzaro, “Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis,” arXiv preprint arXiv:2005.05957, 2020.
- J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” arXiv preprint arXiv:2005.11129, 2020.
- Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text-to-speech,” arXiv preprint arXiv:2006.04558, 2020.
- J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Simonyan, “End-to-end adversarial text-to-speech,” arXiv preprint arXiv:2006.03575, 2020.
- D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. International Conference on Machine Learning, 2015, pp. 1530–1538.
- J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems, 2015.
- E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, et al., “Location-relative attention mechanisms for robust long-form speech synthesis,” in Proc. ICASSP, 2020.
- L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” in Proc. International Conference on Learning Representations, 2015.
- D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, 2018, pp. 10215–10224.
- L. Dinh and S. Bengio, “Density estimation using Real NVP,” in Proc. International Conf. on Learning Representations, 2017.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
- B. Uria, I. Murray, and H. Larochelle, “RNADE: The realvalued neural autoregressive density-estimator,” in Advances in Neural Information Processing Systems, 2013, pp. 2175–2183.
- K. Ito, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017.
- D. Griffin and J. Lim, “Signal estimation from modified shorttime Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, 1984.
- R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proc. IEEE Pacific Rim Conf. on Communications Computers and Signal Processing, 1993.
- D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in Proc. International Conference on Knowledge Discovery and Data Mining, 1994, pp. 359–370.
- Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, 2019.
- C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, et al., “DurIAN: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” arXiv preprint arXiv:2009.00713, 2020.
Full Text
Tags
Comments