AutoAD III: the Prequel -- Back to the Pixels

Tengda Han,Max Bain,Arsha Nagrani,Gül Varol,Weidi Xie,Andrew Zisserman

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)（2024）

引用 0|浏览42

暂无评分

摘要

Generating Audio Description (AD) for movies is a challenging task thatrequires fine-grained visual understanding and an awareness of the charactersand their names. Currently, visual language models for AD generation arelimited by a lack of suitable training data, and also their evaluation ishampered by using performance measures not specialized to the AD domain. Inthis paper, we make three contributions: (i) We propose two approaches forconstructing AD datasets with aligned video data, and build training andevaluation datasets using these. These datasets will be publicly released; (ii)We develop a Q-former-based architecture which ingests raw video and generatesAD, using frozen pre-trained visual encoders and large language models; and(iii) We provide new evaluation metrics to benchmark AD quality that arewell-matched to human performance. Taken together, we improve the state of theart on AD generation.

查看译文

关键词

Language Model,Raw Video,Narrative,Fine-tuned,Visual Features,Large-scale Datasets,Latent Space,Pronouns,Video For Instructions,Audio Files,Visual Scene,External Knowledge,Image Descriptors,Spatial Grid,Short Clips,Text Similarity,Movie Clips,Temporal Alignment,Character Naming,Video Captioning,YouTube,Alignment Pipeline,Training Details

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要