TA2V: Text-Audio Guided Video Generation

Minglu Zhao,Wenmin Wang, Tongbao Chen, Rui Zhang, Ruochen Li

IEEE Transactions on Multimedia(2024)

引用 0|浏览0
暂无评分
摘要
Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper, we propose text-audio to video (TA2V) generation, a new task for generating realistic videos from two different guided modalities, text and audio, which has not been explored much thus far. Compared to image generation, video generation is a harder task because of the complexity of processing higher-dimensional data and scarcer suitable datasets, especially for multimodal video generation. To overcome these limitations, (i) we propose the Text&Audio-guided-Video-Maker (TAgVM) model, which consists of two modules: a text-guided video generator and a text&audio-guided video modifier. (ii) This model uses a 3D VQ-GAN to compress high-dimension video data to a low-dimension discrete sequence, followed by an autoregressive model to guide text-conditional generation in the latent space. Then, we apply a text&audio-guided diffusion model to the generated video scenes, providing additional semantic details corresponding to the audio and text. (iii) We introduce a newly produced music performance video dataset, the University of Rochester Multimodal Music Performance with Video-Audio-Text (URMP-VAT), and a landscape dataset, Landscape with Video-Audio-Text (Landscape-VAT), both of which include three modalities (text, audio, and video) that are aligned with each other. The results demonstrate that our model can create videos with satisfactory quality and semantic information. The source code and datasets are available at https://github.com/Minglu58/TA2V.
更多
查看译文
关键词
Multimodal video generation,text-audio to video,VQ-GAN,diffusion,deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要