EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
arxiv(2024)
摘要
Achieving disentangled control over multiple facial motions and accommodating
diverse input modalities greatly enhances the application and entertainment of
the talking head generation. This necessitates a deep exploration of the
decoupling space for facial features, ensuring that they a) operate
independently without mutual interference and b) can be preserved to share with
different modal input, both aspects often neglected in existing methods. To
address this gap, this paper proposes a novel Efficient Disentanglement
framework for Talking head generation (EDTalk). Our framework enables
individual manipulation of mouth shape, head pose, and emotional expression,
conditioned on video or audio inputs. Specifically, we employ three lightweight
modules to decompose the facial dynamics into three distinct latent spaces
representing mouth, pose, and expression, respectively. Each space is
characterized by a set of learnable bases whose linear combinations define
specific motions. To ensure independence and accelerate training, we enforce
orthogonality among bases and devise an efficient training strategy to allocate
motion responsibilities to each space without relying on external knowledge.
The learned bases are then stored in corresponding banks, enabling shared
visual priors with audio input. Furthermore, considering the properties of each
space, we propose an Audio-to-Motion module for audio-driven talking head
synthesis. Experiments are conducted to demonstrate the effectiveness of
EDTalk. We recommend watching the project website:
https://tanshuai0219.github.io/EDTalk/
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要