谷歌浏览器插件
订阅小程序
在清言上使用

Audiovisual saliency prediction via deep learning.

Neurocomputing(2021)

引用 11|浏览27
暂无评分
摘要
Neuroscience study verifies that synchronized audiovisual stimuli would make a stronger response of visual perception than an independent stimulus. Many researches show that audio signals would affect human gaze behavior in the viewing of natural video scenes. Thus in this paper, we propose a multi-sensory framework of audio and visual signals for video saliency prediction. It mainly includes four modules: auditory feature extraction, visual feature extraction, semantic interaction between auditory feature and visual feature, and feature fusion. With the inputs of audio and visual signals, we present a network architecture of deep learning to undertake the tasks of these four modules. It is an end-to-end architecture that could interact the semantics from its learned features of audio and visual stimuli. The numerical and visual results show our method achieves a significant improvement over eleven recent saliency models that are regardless of the audio stimuli, even some of them are state-of-the-art deep learning models.
更多
查看译文
关键词
Audiovisual saliency,Visual attention,Semantic interaction,Deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要