CAVAN: Commonsense Knowledge Anchored Video Captioning.

ICPR(2022)

引用 0|浏览25
暂无评分
摘要
It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSR-VTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.
更多
查看译文
关键词
BERT [2] based language entailment model,CAVAN,commonsense discriminator,Commonsense knowledge Anchored Video captioning,Commonsense knowledge Anchored Video cAptioNing,commonsense knowledge benefits video caption generation,commonsense-caption entailment corpus,generic knowledge atlas,inferential commonsense knowledge,semantically misaligned captions,training caption,video captioning model,video captioning system,video clip
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要