Exploring Language Prior for Mode-Sensitive Visual Attention Modeling

Xiaoshuai Sun,Xuying Zhang,Liujuan Cao,Yongjian Wu,Feiyue Huang,Rongrong Ji

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020（2020）

引用 4|浏览116

暂无评分

摘要

Modeling human visual attention mechanism is a fundamental problem for the understanding of human vision, which has also been demonstrated as an important module for various multimedia applications such as image captioning and visual question answering. In this paper, we propose a new probabilistic framework for attention, and introduce the concept ofmode to model the flexibility and adaptability of attention modulation in complex environments. We characterize the correlations between the visual input, the activated mode, the saliency and the spatial allocation of attention via a graphical model representation, based on which we explore the lingual guidance from captioning data for the implementation of a mode-sensitive attention (MSA) model. The proposed framework explicitly justifies the usage of center bias for fixation prediction and can convert an arbitrary learning-based backbone attention model to a more robust multi-mode version. Experimental results on the York120, MIT1003 and PASCAL datasets demonstrate the effectiveness of the proposed method.

查看译文

关键词

Language Prior, Caption Semantics, Multi-Mode Attention

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要