Audiovisual saliency prediction via deep learning.

Jiazhong Chen,Qingqing Li,Hefei Ling,Dakai Ren,Ping Duan

Neurocomputing（2021）

引用 11|浏览27

暂无评分

摘要

Neuroscience study verifies that synchronized audiovisual stimuli would make a stronger response of visual perception than an independent stimulus. Many researches show that audio signals would affect human gaze behavior in the viewing of natural video scenes. Thus in this paper, we propose a multi-sensory framework of audio and visual signals for video saliency prediction. It mainly includes four modules: auditory feature extraction, visual feature extraction, semantic interaction between auditory feature and visual feature, and feature fusion. With the inputs of audio and visual signals, we present a network architecture of deep learning to undertake the tasks of these four modules. It is an end-to-end architecture that could interact the semantics from its learned features of audio and visual stimuli. The numerical and visual results show our method achieves a significant improvement over eleven recent saliency models that are regardless of the audio stimuli, even some of them are state-of-the-art deep learning models.

查看译文

关键词

Audiovisual saliency,Visual attention,Semantic interaction,Deep learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要