Speech driven video editing via an audio-conditioned diffusion model

IMAGE AND VISION COMPUTING(2024)

引用 0|浏览0
暂无评分
摘要
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronised without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single -speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audiodriven video editing. All code, datasets, and models used as part of this work are made publicly available here: https://danbigioi.github.io/DiffusionVideoEditing/.
更多
查看译文
关键词
Video editing,Talking head generation,Generative AI,Diffusion models,Dubbing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要