Generating Video Descriptions with Latent Topic Guidance

IEEE Transactions on Multimedia(2019)

引用 30|浏览120
暂无评分
摘要
Automatic video description generation (a.k.a video captioning) is one of the ultimate goals for video understanding. Despite the wide range of applications such as video indexing and retrieval etc., the video captioning task remains quite challenging due to the complexity and diversity of video contents. First, open-domain videos cover a broad range of topics, which results in highly variable vocabularies and expression styles to describe the video contents. Second, videos naturally contain multiple modalities including image, motion and acoustic media. The information provided by different modalities differs in different conditions. In this paper, we propose a novel topic-guided video captioning model to address the above challenges in video captioning. Our model consists of two joint tasks, namely latent topic generation and topic-guided caption generation. The topic generation task aims to automatically predict the latent topic of the video. Since there is no groundtruth topic information, we mine multimodal topics in an unsupervised fashion based on video contents and annotated captions, and then distill the topic distribution to a topic prediction model. In the topic-guided generation task, we employ the topic guidance for two purposes. The first is to narrow down the language complexity across topics, where we propose the topic-aware decoder to leverage the latent topics to induce topic-related language models. The decoder is also generic and can be integrated with temporal attention mechanism. The second is to dynamically attend to important modalities by topics, where we propose a flexible topic-guided multimodal ensemble framework and use the topic gating network to determine the attention weights. The two tasks are correlated with each other and they collaborate to generate more detailed and accurate video captions. Our extensive experiments on two public benchmark datasets MSR-VTT and Youtube2Text demonstrate the effectiveness of the proposed topic-guided video captioning system, which achieves the state-of-the-art performance on both datasets.
更多
查看译文
关键词
Task analysis,Decoding,Predictive models,Sports,Acoustics,Visualization,Semantics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要