Saliency-Based Spatiotemporal Attention for Video Captioning

Yangyu Chen,Weigang Zhang,Shuhui Wang,Liang Li,Qingming Huang

2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM)（2018）

引用 5|浏览1

暂无评分

摘要

Most existing video captioning methods ignore the visual saliency information in videos. We suppose that using saliency information can be helpful to generate more accurate video captions. Therefore, we propose a saliency-based spatiotemporal attention mechanism, and integrate it with the encoder-decoder framework of the classical video captioning model. Especially, we design a residual block which can use the saliency information to properly extract the visual feature of video frames. We evaluate our method on MSVD dataset and the results show that exploiting the visual saliency information can improve the performance of video captioning. Specifically, when compared with the traditional temporal attention method, our saliency-based temporal attention model can improve the METEOR and CIDEr metrics by 3.4% and 22.5% respectively. While by using the full saliency-based spatiotemporal attention mechanism, we can further improve the METEOR and CIDEr by 4.5% and 23.1% respectively.

查看译文

关键词

Video Captioning,Spatiotemporal Attention,Visual Saliency

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要