Chrome Extension
WeChat Mini Program
Use on ChatGLM

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Computing Research Repository (CoRR)(2024)

University of Science and Technology of China Microsoft Research | The Chinese University of Hong Kong | Microsoft | The University of Tokyo | Xi'an Jiaotong University | University of Science and Technology of China | Microsoft Research Asia | Zhejiang University | Chinese University of Hong Kong | Peking University | Tsinghua University

Cited 10|Views105
Abstract
While recent large-scale text-to-speech (TTS) models have achievedsignificant progress, they still fall short in speech quality, similarity, andprosody. Considering speech intricately encompasses various attributes (e.g.,content, prosody, timbre, and acoustic details) that pose significantchallenges for generation, a natural idea is to factorize speech intoindividual subspaces representing different attributes and generate themindividually. Motivated by it, we propose NaturalSpeech 3, a TTS system withnovel factorized diffusion models to generate natural speech in a zero-shotway. Specifically, 1) we design a neural codec with factorized vectorquantization (FVQ) to disentangle speech waveform into subspaces of content,prosody, timbre, and acoustic details; 2) we propose a factorized diffusionmodel to generate attributes in each subspace following its correspondingprompt. With this factorization design, NaturalSpeech 3 can effectively andefficiently model the intricate speech with disentangled subspaces in adivide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms thestate-of-the-art TTS systems on quality, similarity, prosody, andintelligibility. Furthermore, we achieve better performance by scaling to 1Bparameters and 200K hours of training data.
More
Translated text
Key words
Acoustic Modeling,Speaker Diarization,Speech Enhancement,Audio-Visual Speech Recognition,Noise Reduction
PDF
Bibtex
AI Read Science
Video&Figures
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Related Papers
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了一种名为NaturalSpeech 3的零样本语音合成系统,该系统采用分解的扩散模型和因子化编码器,有效提升了语音质量、相似度、韵律和可懂度。

方法】:该系统包括一个因子化的向量量化编码器和因子化扩散模型,分别将语音波形分解为内容、韵律、音色和 acoustic details 的子空间,并按照相应的提示生成每个子空间的属性。

实验】:实验结果表明,NaturalSpeech 3在语音质量、相似度、韵律和可懂度方面优于现有最佳文本到语音系统。当扩展到10亿个参数和20万小时的训练数据时,性能得到进一步提升。使用的数据集未在文中提及。