NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Computing Research Repository (CoRR)(2024)
University of Science and Technology of China Microsoft Research | The Chinese University of Hong Kong | Microsoft | The University of Tokyo | Xi'an Jiaotong University | University of Science and Technology of China | Microsoft Research Asia | Zhejiang University | Chinese University of Hong Kong | Peking University | Tsinghua University
The authors of this paper include Zeqian Ju, Yuancheng Wang, Shen Kai, Tan Xu, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Song Kaitao, Tang Siliang, Wu Zhizheng, Qin Tao, Li Xiangyang, Ye Wei, Zhang Shikun, Jiang Bian, Lei He, and Jinyu Li. They come from renowned institutions such as the University of Science and Technology of China, the Chinese University of Hong Kong, Zhejiang University, and Microsoft Research Asia, with research directions covering areas such as music generation, multimodal, speech recognition, speaker recognition, text-to-speech, machine translation, deep learning, natural language processing, and computer vision.
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
1. Abstract
- Propose NaturalSpeech 3, a zero-shot speech synthesis system based on factorized codec and diffusion models.
- Utilize factorized vector quantization (FVQ) to decompose speech waveforms into subspaces such as content, prosody, timbre, and acoustic details.
- Introduce a factorized diffusion model to generate attributes in each subspace based on the corresponding prompts.
- Outperforms existing TTS systems in terms of speech quality, similarity, prosody, and intelligibility.
2. Introduction
- Current TTS systems have shortcomings in speech quality, similarity, and prosody.
- Speech contains multiple attributes like content, prosody, timbre, and acoustic details, which are challenging to model.
- Decomposing speech into different attribute subspaces and generating them separately is a natural solution to this problem.
3. NaturalSpeech 3
3.1 Overall Architecture
- NaturalSpeech 3 consists of two parts:
- A neural speech codec (FACodec) for attribute decoupling.
- A factorized diffusion model for generating factorized speech attributes.
3.2 FACodec
- FACodec decomposes speech waveforms into subspaces such as content, prosody, timbre, and acoustic details.
- Implements techniques like information bottleneck, supervision, gradient reversal, and detail dropout for better attribute decoupling.
3.3 Factorized Diffusion Model
- Generates attributes in each subspace based on the corresponding prompts.
- Uses discrete diffusion for generation and employs uncollapsed classifier guidance technique.
3.4 Relation to the NaturalSpeech Series
- NaturalSpeech 3 is the latest version in the NaturalSpeech series aimed at generating natural speech.
- Compared to NaturalSpeech 1 and NaturalSpeech 2, NaturalSpeech 3 has improvements in architecture, speech representation, and generation methods.
4. Experiments and Results
- Evaluate the performance of NaturalSpeech 3 on the LibriSpeech and RAVDESS datasets.
- Results show that NaturalSpeech 3 outperforms existing TTS systems in terms of speech quality, similarity, prosody, and intelligibility.
5. Conclusion
- NaturalSpeech 3 is an effective zero-shot speech synthesis system that outperforms existing TTS systems in speech quality, similarity, prosody, and intelligibility.
- Future work will explore more attributes, data coverage, and neural speech codecs.
Q: What specific research methods were used in the paper?
1. Neural Speech Codec (FACodec)
- Subspace decomposition of content, prosody, timbre, and acoustic details: Utilizes factorized vector quantization (FVQ) to decompose speech waveforms into different subspaces, each representing a different attribute.
- Information bottleneck: Ensures that each code embedding contains less information by projecting the encoder output into a low-dimensional space and quantizing it, thus promoting information decoupling.
- Supervision: Introduces auxiliary tasks for each attribute, such as predicting pitch, phoneme labels, and speaker ID, to achieve high-quality speech decoupling.
- Gradient reversal: Uses adversarial classifiers and gradient reversal layers (GRL) to eliminate unwanted information in the latent space, such as content or prosody information.
- Detail dropout: Balances decoupling and reconstruction quality by randomly masking representations in the acoustic details subspace.
2. Factorized Diffusion Model
- Discrete diffusion: Uses discrete diffusion models to generate representations for each factorized speech attribute, such as duration, content, prosody, and acoustic details.
- Conditional generation: Controls the generation process using prompts related to the corresponding attributes, such as duration prompts, content prompts, and acoustic detail prompts.
- Unconditional guidance: Does not use prompts during training and guides the model output to conditional generation during inference, thereby improving generation quality.
3. Speech Attribute Manipulation
- Attribute prompts: Controls speech attributes by selecting different attribute prompts, such as timbre, prosody, and duration.
- Contextual learning: Uses natural language processing models to convert text prompts into speech attribute prompts.
Q: What are the main research findings and achievements?
Speech Quality
- NaturalSpeech 3 achieved or better speech quality than real recordings on the LibriSpeech test set.
- NaturalSpeech 3 achieved significant improvements in speech quality compared to baseline models.
Speech Similarity
- NaturalSpeech 3 achieved a new SOTA in similarity between synthesized speech and prompt speech.
- NaturalSpeech 3 achieved significant improvements in speech similarity compared to baseline models.
Prosody
- NaturalSpeech 3 achieved significant improvements in prosody with an average MCD of -0.16 and SMOS of +0.21.
- NaturalSpeech 3 achieved significant improvements in prosody compared to other TTS systems.
Controllability
- NaturalSpeech 3 can control speech attributes by selecting different attribute prompts, such as timbre, prosody, and duration.
- NaturalSpeech 3 has higher controllability compared to baseline models.
Q: What are the current limitations of this research?
Attribute Coverage
- The current model cannot extract attributes such as background noise.
- Future work will explore more attributes, such as energy and background noise.
Data Coverage
- The model was trained only on English corpora from LibriVox audiobooks.
- Future work will collect more diverse speech data to support multilingual TTS.
Neural Speech Codec
- Requires phoneme transcriptions for content supervision, which limits scalability.
- The decoupling ability has only been verified in zero-shot speech synthesis tasks.
- Future work will explore more general decoupling methods and explore more tasks, such as zero-shot speech conversion and automatic speech recognition.

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
被引用0
被引用0
The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio
被引用0
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
被引用0