Exploring Language Prior for Mode-Sensitive Visual Attention Modeling

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 4|浏览116
暂无评分
摘要
Modeling human visual attention mechanism is a fundamental problem for the understanding of human vision, which has also been demonstrated as an important module for various multimedia applications such as image captioning and visual question answering. In this paper, we propose a new probabilistic framework for attention, and introduce the concept ofmode to model the flexibility and adaptability of attention modulation in complex environments. We characterize the correlations between the visual input, the activated mode, the saliency and the spatial allocation of attention via a graphical model representation, based on which we explore the lingual guidance from captioning data for the implementation of a mode-sensitive attention (MSA) model. The proposed framework explicitly justifies the usage of center bias for fixation prediction and can convert an arbitrary learning-based backbone attention model to a more robust multi-mode version. Experimental results on the York120, MIT1003 and PASCAL datasets demonstrate the effectiveness of the proposed method.
更多
查看译文
关键词
Language Prior, Caption Semantics, Multi-Mode Attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要