CAVAN: Commonsense Knowledge Anchored Video Captioning.

ICPR（2022）

引用 0|浏览25

暂无评分

摘要

It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSR-VTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.

查看译文

关键词

BERT [2] based language entailment model,CAVAN,commonsense discriminator,Commonsense knowledge Anchored Video captioning,Commonsense knowledge Anchored Video cAptioNing,commonsense knowledge benefits video caption generation,commonsense-caption entailment corpus,generic knowledge atlas,inferential commonsense knowledge,semantically misaligned captions,training caption,video captioning model,video captioning system,video clip

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要