FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
CoRR(2024)
摘要
In this paper, we abstract the process of people hearing speech, extracting
meaningful cues, and creating various dynamically audio-consistent talking
faces, termed Listening and Imagining, into the task of high-fidelity diverse
talking faces generation from a single audio. Specifically, it involves two
critical challenges: one is to effectively decouple identity, content, and
emotion from entangled audio, and the other is to maintain intra-video
diversity and inter-video consistency. To tackle the issues, we first dig out
the intricate relationships among facial factors and simplify the decoupling
process, tailoring a Progressive Audio Disentanglement for accurate facial
geometry and semantics learning, where each stage incorporates a customized
training module responsible for a specific factor. Secondly, to achieve
visually diverse and audio-synchronized animation solely from input audio
within a single model, we introduce the Controllable Coherent Frame generation,
which involves the flexible integration of three trainable adapters with frozen
Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and
semantics, as well as texture and temporal coherence between frames. In this
way, we inherit high-quality diverse generation from LDMs while significantly
improving their controllability at a low training cost. Extensive experiments
demonstrate the flexibility and effectiveness of our method in handling this
paradigm. The codes will be released at
https://github.com/modelscope/facechain.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要