VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
IEEE/ACM Transactions on Audio, Speech, and Language Processing(2020)
摘要
In this paper, we propose a non-parallel any-to-many voice conversion (VC)
method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel
waveform generation method, VoiceGrad is based upon the concepts of score
matching and Langevin dynamics. It uses weighted denoising score matching to
train a score approximator, a fully convolutional network with a U-Net
structure designed to predict the gradient of the log density of the speech
feature sequences of multiple speakers, and performs VC by using annealed
Langevin dynamics to iteratively update an input feature sequence towards the
nearest stationary point of the target distribution based on the trained score
approximator network. Thanks to the nature of this concept, VoiceGrad enables
any-to-many VC, a VC scenario in which the speaker of input speech can be
arbitrary, and allows for non-parallel training, which requires no parallel
utterances or transcriptions.
更多查看译文
关键词
Voice conversion (VC),non-parallel VC,any-to-many VC,score matching,Langevin dynamics,diffusion models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络