Saliency-Based Spatiotemporal Attention for Video Captioning

2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM)(2018)

引用 5|浏览1
暂无评分
摘要
Most existing video captioning methods ignore the visual saliency information in videos. We suppose that using saliency information can be helpful to generate more accurate video captions. Therefore, we propose a saliency-based spatiotemporal attention mechanism, and integrate it with the encoder-decoder framework of the classical video captioning model. Especially, we design a residual block which can use the saliency information to properly extract the visual feature of video frames. We evaluate our method on MSVD dataset and the results show that exploiting the visual saliency information can improve the performance of video captioning. Specifically, when compared with the traditional temporal attention method, our saliency-based temporal attention model can improve the METEOR and CIDEr metrics by 3.4% and 22.5% respectively. While by using the full saliency-based spatiotemporal attention mechanism, we can further improve the METEOR and CIDEr by 4.5% and 23.1% respectively.
更多
查看译文
关键词
Video Captioning,Spatiotemporal Attention,Visual Saliency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要