NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
arXiv (Cornell University)(2024)
Abstract
While recent large-scale text-to-speech (TTS) models have achievedsignificant progress, they still fall short in speech quality, similarity, andprosody. Considering speech intricately encompasses various attributes (e.g.,content, prosody, timbre, and acoustic details) that pose significantchallenges for generation, a natural idea is to factorize speech intoindividual subspaces representing different attributes and generate themindividually. Motivated by it, we propose NaturalSpeech 3, a TTS system withnovel factorized diffusion models to generate natural speech in a zero-shotway. Specifically, 1) we design a neural codec with factorized vectorquantization (FVQ) to disentangle speech waveform into subspaces of content,prosody, timbre, and acoustic details; 2) we propose a factorized diffusionmodel to generate attributes in each subspace following its correspondingprompt. With this factorization design, NaturalSpeech 3 can effectively andefficiently model the intricate speech with disentangled subspaces in adivide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms thestate-of-the-art TTS systems on quality, similarity, prosody, andintelligibility. Furthermore, we achieve better performance by scaling to 1Bparameters and 200K hours of training data.
MoreTranslated text
Key words
Acoustic Modeling,Speaker Diarization,Speech Enhancement,Audio-Visual Speech Recognition,Noise Reduction
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined