Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

RJ Skerry-Ryan
RJ Skerry-Ryan
Eric Battenberg
Eric Battenberg
Soroosh Mariooryad
Soroosh Mariooryad
Diederik P. Kingma
Diederik P. Kingma
Cited by: 0|Bibtex|Views7
Other Links: arxiv.org
Weibo:
We have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop

Abstract:

We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hund...More

Code:

Data:

0
Introduction
  • Modern text-to-speech (TTS) synthesis systems, using deep neural networks trained on large quantities of data, are able to generate natural speech.
  • A synthesis model predicts intermediate audio features from text, typically spectrograms [1, 2], vocoder features [3], or linguistic features [4], controlling the long-term structure of the generated speech
  • This is followed by a neural vocoder which converts the features to time-domain waveform samples, filling in low-level signal detail.
  • Other approaches have eschewed probabilistic models and used GANs [10, 11], or carefully constructed spectral losses [12, 13, 14]
Highlights
  • Modern text-to-speech (TTS) synthesis systems, using deep neural networks trained on large quantities of data, are able to generate natural speech
  • As can be seen below, Mel spectral distortion (MSD) and Mel cepstral distortion (MCD) tend to negatively correlate with subjective mean opinion score (MOS), we primarily focus on MOS
  • We have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop
  • The hybrid model structure combines the simplicity of attention-based TTS models with the parallel generation capabilities of a normalizing flow to generate waveform samples directly
  • Exploring the feasibility of adapting the proposed model to fully parallel TTS generation remains an interesting direction for future work
Methods
  • The authors experiment with two single-speaker datasets: A proprietary dataset containing about 39 hours of speech, sampled at 24 kHz, from a professional female voice talent which was used in previous studies [4, 1, 2].
  • As a lower bound on performance, the authors jointly train a Tacotron model with a post-net (Tacotron-PN), consisting of a 20-layer non-causal WaveNet stack split into two dilation cycles
  • This converts mel spectrograms output by the decoder to full linear frequency spectrograms which are inverted to waveform samples using 100 iterations of the Griffin-Lim algorithm [29], similar to [1]
Conclusion
  • The authors have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop.
  • The network structure exposes the output frame size K as a hyperparameter, the decoder remains fundamentally autoregressive, requiring sequential generation of output frames.
  • This puts the approach at a disadvantage compared to recent advances in parallel TTS [32, 33], unless the output step size can be made very large.
  • It would be interesting to explore more efficient alternatives to flows in a similar text-to-waveform setting, e.g., diffusion probabilistic models [34], or GANs [11, 19], which can be more optimized in a mode-seeking fashion that is likely to be more efficient than modeling the full data distribution
Summary
  • Introduction:

    Modern text-to-speech (TTS) synthesis systems, using deep neural networks trained on large quantities of data, are able to generate natural speech.
  • A synthesis model predicts intermediate audio features from text, typically spectrograms [1, 2], vocoder features [3], or linguistic features [4], controlling the long-term structure of the generated speech
  • This is followed by a neural vocoder which converts the features to time-domain waveform samples, filling in low-level signal detail.
  • Other approaches have eschewed probabilistic models and used GANs [10, 11], or carefully constructed spectral losses [12, 13, 14]
  • Methods:

    The authors experiment with two single-speaker datasets: A proprietary dataset containing about 39 hours of speech, sampled at 24 kHz, from a professional female voice talent which was used in previous studies [4, 1, 2].
  • As a lower bound on performance, the authors jointly train a Tacotron model with a post-net (Tacotron-PN), consisting of a 20-layer non-causal WaveNet stack split into two dilation cycles
  • This converts mel spectrograms output by the decoder to full linear frequency spectrograms which are inverted to waveform samples using 100 iterations of the Griffin-Lim algorithm [29], similar to [1]
  • Conclusion:

    The authors have proposed a model for end-to-end text-to-speech waveform synthesis, incorporating a normalizing flow into the autoregressive Tacotron decoder loop.
  • The network structure exposes the output frame size K as a hyperparameter, the decoder remains fundamentally autoregressive, requiring sequential generation of output frames.
  • This puts the approach at a disadvantage compared to recent advances in parallel TTS [32, 33], unless the output step size can be made very large.
  • It would be interesting to explore more efficient alternatives to flows in a similar text-to-waveform setting, e.g., diffusion probabilistic models [34], or GANs [11, 19], which can be more optimized in a mode-seeking fashion that is likely to be more efficient than modeling the full data distribution
Tables
  • Table1: TTS performance on the proprietary single speaker dataset
  • Table2: TTS performance on LJ Speech with character inputs
  • Table3: Generation speed in seconds, comparing one TPU v3 core to a 6-core Intel Xeon W-2135 CPU, generating 5 seconds of speech conditioned on 90 input tokens, batch size 1. Average of 500 trials
  • Table4: Ablations on the proprietary dataset using phone inputs and a shallow decoder residual LSTM stack of 2 layers with 256 units. Unless otherwise specified, samples are generated using T = 0.8
Download tables as Excel
Reference
  • Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017.
    Google ScholarLocate open access versionFindings
  • J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
    Google ScholarLocate open access versionFindings
  • J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech synthesis,” in Proc. International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, et al., “WaveNet: A generative model for raw audio,” CoRR abs/1609.03499, 2016.
    Findings
  • N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, et al., “Efficient neural audio synthesis,” in Proc. International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, et al., “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,” in Proc. International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech,” in Proc. International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • S. Kim, S.-G. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet: A generative flow for raw audio,” in Proc. International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flowbased Generative Network for Speech Synthesis,” in Proc. ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • M. Binkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High Fidelity Speech Synthesis with Adversarial Networks,” in Proc. International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • S. O. Arık, H. Jun, and G. Diamos, “Fast spectrogram inversion using multi-head convolutional neural networks,” IEEE Signal Processing Letters, vol. 26, no. 1, pp. 94–98, 2018.
    Google ScholarLocate open access versionFindings
  • X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filterbased waveform model for statistical parametric speech synthesis,” in Proc. ICASSP, 2019, pp. 5916–5920.
    Google ScholarLocate open access versionFindings
  • A. A. Gritsenko, T. Salimans, R. v. d. Berg, J. Snoek, and N. Kalchbrenner, “A Spectral Energy Distance for Parallel Speech Synthesis,” arXiv preprint arXiv:2008.01160, 2020.
    Findings
  • C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-TTS: A non-autoregressive network for text to speech based on flow,” in Proc. ICASSP, 2020, pp. 7209–7213.
    Google ScholarLocate open access versionFindings
  • R. Valle, K. Shih, R. Prenger, and B. Catanzaro, “Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis,” arXiv preprint arXiv:2005.05957, 2020.
    Findings
  • J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” arXiv preprint arXiv:2005.11129, 2020.
    Findings
  • Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text-to-speech,” arXiv preprint arXiv:2006.04558, 2020.
    Findings
  • J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Simonyan, “End-to-end adversarial text-to-speech,” arXiv preprint arXiv:2006.03575, 2020.
    Findings
  • D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. International Conference on Machine Learning, 2015, pp. 1530–1538.
    Google ScholarLocate open access versionFindings
  • J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, et al., “Location-relative attention mechanisms for robust long-form speech synthesis,” in Proc. ICASSP, 2020.
    Google ScholarLocate open access versionFindings
  • L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” in Proc. International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, 2018, pp. 10215–10224.
    Google ScholarLocate open access versionFindings
  • L. Dinh and S. Bengio, “Density estimation using Real NVP,” in Proc. International Conf. on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • B. Uria, I. Murray, and H. Larochelle, “RNADE: The realvalued neural autoregressive density-estimator,” in Advances in Neural Information Processing Systems, 2013, pp. 2175–2183.
    Google ScholarLocate open access versionFindings
  • K. Ito, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017.
    Findings
  • D. Griffin and J. Lim, “Signal estimation from modified shorttime Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, 1984.
    Google ScholarLocate open access versionFindings
  • R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proc. IEEE Pacific Rim Conf. on Communications Computers and Signal Processing, 1993.
    Google ScholarLocate open access versionFindings
  • D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in Proc. International Conference on Knowledge Discovery and Data Mining, 1994, pp. 359–370.
    Google ScholarLocate open access versionFindings
  • Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, et al., “DurIAN: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019.
    Findings
  • N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” arXiv preprint arXiv:2009.00713, 2020.
    Findings
Full Text
Your rating :
0

 

Tags
Comments