谷歌浏览器插件
订阅小程序
在清言上使用

AutoAD III: the Prequel -- Back to the Pixels

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2024)

引用 0|浏览42
暂无评分
摘要
Generating Audio Description (AD) for movies is a challenging task thatrequires fine-grained visual understanding and an awareness of the charactersand their names. Currently, visual language models for AD generation arelimited by a lack of suitable training data, and also their evaluation ishampered by using performance measures not specialized to the AD domain. Inthis paper, we make three contributions: (i) We propose two approaches forconstructing AD datasets with aligned video data, and build training andevaluation datasets using these. These datasets will be publicly released; (ii)We develop a Q-former-based architecture which ingests raw video and generatesAD, using frozen pre-trained visual encoders and large language models; and(iii) We provide new evaluation metrics to benchmark AD quality that arewell-matched to human performance. Taken together, we improve the state of theart on AD generation.
更多
查看译文
关键词
Language Model,Raw Video,Narrative,Fine-tuned,Visual Features,Large-scale Datasets,Latent Space,Pronouns,Video For Instructions,Audio Files,Visual Scene,External Knowledge,Image Descriptors,Spatial Grid,Short Clips,Text Similarity,Movie Clips,Temporal Alignment,Character Naming,Video Captioning,YouTube,Alignment Pipeline,Training Details
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要