Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Sun Guangzhi
Sun Guangzhi
Rosenberg Andrew
Rosenberg Andrew

ICASSP, pp. 6699-6703, 2020.

Cited by: 10|Bibtex|Views62|DOI:https://doi.org/10.1109/ICASSP40776.2020.9053436
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de

Abstract:

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard ...More

Code:

Data:

Full Text
Your rating :
0

 

Tags
Comments