HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
arxiv(2024)
摘要
We introduce HybridVC, a voice conversion (VC) framework built upon a
pre-trained conditional variational autoencoder (CVAE) that combines the
strengths of a latent model with contrastive learning. HybridVC supports text
and audio prompts, enabling more flexible voice style conversion. HybridVC
models a latent distribution conditioned on speaker embeddings acquired by a
pretrained speaker encoder and optimises style text embeddings to align with
the speaker style information through contrastive learning in parallel.
Therefore, HybridVC can be efficiently trained under limited computational
resources. Our experiments demonstrate HybridVC's superior training efficiency
and its capability for advanced multi-modal voice style conversion. This
underscores its potential for widespread applications such as user-defined
personalised voice in various social media platforms. A comprehensive ablation
study further validates the effectiveness of our method.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要