NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zeqian Ju,Yuancheng Wang,Kai Shen,Xu Tan,Detai Xin,Dongchao Yang,Eric Liu,Yichong Leng,Kaitao Song,Siliang Tang,Zhizheng Wu,Tao Qin,Xiangyang Li,Wei Ye,Shikun Zhang,Jiang Bian,Lei He,Jinyu Li,Sheng Zhao

Computing Research Repository (CoRR)（2024）

Cited 167|Views105

Abstract

While recent large-scale text-to-speech (TTS) models have achievedsignificant progress, they still fall short in speech quality, similarity, andprosody. Considering speech intricately encompasses various attributes (e.g.,content, prosody, timbre, and acoustic details) that pose significantchallenges for generation, a natural idea is to factorize speech intoindividual subspaces representing different attributes and generate themindividually. Motivated by it, we propose NaturalSpeech 3, a TTS system withnovel factorized diffusion models to generate natural speech in a zero-shotway. Specifically, 1) we design a neural codec with factorized vectorquantization (FVQ) to disentangle speech waveform into subspaces of content,prosody, timbre, and acoustic details; 2) we propose a factorized diffusionmodel to generate attributes in each subspace following its correspondingprompt. With this factorization design, NaturalSpeech 3 can effectively andefficiently model the intricate speech with disentangled subspaces in adivide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms thestate-of-the-art TTS systems on quality, similarity, prosody, andintelligibility. Furthermore, we achieve better performance by scaling to 1Bparameters and 200K hours of training data.

Translated text

Key words

Acoustic Modeling,Speaker Diarization,Speech Enhancement,Audio-Visual Speech Recognition,Noise Reduction

Bibtex

AI Read Science

Video&Figures

论文作者介绍

The authors of this paper include Zeqian Ju, Yuancheng Wang, Shen Kai, Tan Xu, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Song Kaitao, Tang Siliang, Wu Zhizheng, Qin Tao, Li Xiangyang, Ye Wei, Zhang Shikun, Jiang Bian, Lei He, and Jinyu Li. They come from renowned institutions such as the University of Science and Technology of China, the Chinese University of Hong Kong, Zhejiang University, and Microsoft Research Asia, with research directions covering areas such as music generation, multimodal, speech recognition, speaker recognition, text-to-speech, machine translation, deep learning, natural language processing, and computer vision.

文献大纲

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

1. Abstract
- Propose NaturalSpeech 3, a zero-shot speech synthesis system based on factorized codec and diffusion models.
- Utilize factorized vector quantization (FVQ) to decompose speech waveforms into subspaces such as content, prosody, timbre, and acoustic details.
- Introduce a factorized diffusion model to generate attributes in each subspace based on the corresponding prompts.
- Outperforms existing TTS systems in terms of speech quality, similarity, prosody, and intelligibility.
2. Introduction
- Current TTS systems have shortcomings in speech quality, similarity, and prosody.
- Speech contains multiple attributes like content, prosody, timbre, and acoustic details, which are challenging to model.
- Decomposing speech into different attribute subspaces and generating them separately is a natural solution to this problem.
3. NaturalSpeech 3

3.1 Overall Architecture
- NaturalSpeech 3 consists of two parts:
  - A neural speech codec (FACodec) for attribute decoupling.
  - A factorized diffusion model for generating factorized speech attributes.
3.2 FACodec
- FACodec decomposes speech waveforms into subspaces such as content, prosody, timbre, and acoustic details.
- Implements techniques like information bottleneck, supervision, gradient reversal, and detail dropout for better attribute decoupling.
3.3 Factorized Diffusion Model
- Generates attributes in each subspace based on the corresponding prompts.
- Uses discrete diffusion for generation and employs uncollapsed classifier guidance technique.
3.4 Relation to the NaturalSpeech Series
- NaturalSpeech 3 is the latest version in the NaturalSpeech series aimed at generating natural speech.
- Compared to NaturalSpeech 1 and NaturalSpeech 2, NaturalSpeech 3 has improvements in architecture, speech representation, and generation methods.
4. Experiments and Results
- Evaluate the performance of NaturalSpeech 3 on the LibriSpeech and RAVDESS datasets.
- Results show that NaturalSpeech 3 outperforms existing TTS systems in terms of speech quality, similarity, prosody, and intelligibility.
5. Conclusion
- NaturalSpeech 3 is an effective zero-shot speech synthesis system that outperforms existing TTS systems in speech quality, similarity, prosody, and intelligibility.
- Future work will explore more attributes, data coverage, and neural speech codecs.

关键问题

Q: What specific research methods were used in the paper?

1. Neural Speech Codec (FACodec)
- Subspace decomposition of content, prosody, timbre, and acoustic details: Utilizes factorized vector quantization (FVQ) to decompose speech waveforms into different subspaces, each representing a different attribute.
- Information bottleneck: Ensures that each code embedding contains less information by projecting the encoder output into a low-dimensional space and quantizing it, thus promoting information decoupling.
- Supervision: Introduces auxiliary tasks for each attribute, such as predicting pitch, phoneme labels, and speaker ID, to achieve high-quality speech decoupling.
- Gradient reversal: Uses adversarial classifiers and gradient reversal layers (GRL) to eliminate unwanted information in the latent space, such as content or prosody information.
- Detail dropout: Balances decoupling and reconstruction quality by randomly masking representations in the acoustic details subspace.
2. Factorized Diffusion Model
- Discrete diffusion: Uses discrete diffusion models to generate representations for each factorized speech attribute, such as duration, content, prosody, and acoustic details.
- Conditional generation: Controls the generation process using prompts related to the corresponding attributes, such as duration prompts, content prompts, and acoustic detail prompts.
- Unconditional guidance: Does not use prompts during training and guides the model output to conditional generation during inference, thereby improving generation quality.
3. Speech Attribute Manipulation
- Attribute prompts: Controls speech attributes by selecting different attribute prompts, such as timbre, prosody, and duration.
- Contextual learning: Uses natural language processing models to convert text prompts into speech attribute prompts.
Q: What are the main research findings and achievements?

Speech Quality
- NaturalSpeech 3 achieved or better speech quality than real recordings on the LibriSpeech test set.
- NaturalSpeech 3 achieved significant improvements in speech quality compared to baseline models.
Speech Similarity
- NaturalSpeech 3 achieved a new SOTA in similarity between synthesized speech and prompt speech.
- NaturalSpeech 3 achieved significant improvements in speech similarity compared to baseline models.
Prosody
- NaturalSpeech 3 achieved significant improvements in prosody with an average MCD of -0.16 and SMOS of +0.21.
- NaturalSpeech 3 achieved significant improvements in prosody compared to other TTS systems.
Controllability
- NaturalSpeech 3 can control speech attributes by selecting different attribute prompts, such as timbre, prosody, and duration.
- NaturalSpeech 3 has higher controllability compared to baseline models.
Q: What are the current limitations of this research?

Attribute Coverage
- The current model cannot extract attributes such as background noise.
- Future work will explore more attributes, such as energy and background noise.
Data Coverage
- The model was trained only on English corpora from LibriVox audiobooks.
- Future work will collect more diverse speech data to support multilingual TTS.
Neural Speech Codec
- Requires phoneme transcriptions for content supervision, which limits scalability.
- The decoupling ability has only been verified in zero-shot speech synthesis tasks.
- Future work will explore more general decoupling methods and explore more tasks, such as zero-shot speech conversion and automatic speech recognition.

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

1. Abstract

2. Introduction

3. NaturalSpeech 3

3.1 Overall Architecture

3.2 FACodec

3.3 Factorized Diffusion Model

3.4 Relation to the NaturalSpeech Series

4. Experiments and Results

5. Conclusion

Q: What specific research methods were used in the paper?

1. Neural Speech Codec (FACodec)

2. Factorized Diffusion Model

3. Speech Attribute Manipulation

Q: What are the main research findings and achievements?

Speech Quality

Speech Similarity

Prosody

Controllability

Q: What are the current limitations of this research?

Attribute Coverage

Data Coverage

Neural Speech Codec