Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 79|浏览257
暂无评分
摘要
We address the challenging task of event localization, which requires the machine to localize an event and recognize its category in unconstrained videos. Most existing methods leverage only the visual information of a video while neglecting its audio information, which, however, can be very helpful and important for event localization. For example, humans often recognize an event by reasoning with the visual and audio content simultaneously. Moreover, the audio information can guide the model to pay more attention on the informative regions of visual scenes, which can help to reduce the interference brought by the background. Motivated by these, in this paper, we propose a relation-aware network to leverage both audio and visual information for accurate event localization. Specifically, to reduce the interference brought by the background, we propose an audio-guided spatial-channel attention module to guide the model to focus on event-relevant visual regions. Besides, we propose to build connections between visual and audio modalities with a relation-aware module. In particular, we learn the representations of video and/or audio segments by aggregating information from the other modality according to the cross-modal relations. Last, relying on the relation-aware representations, we conduct event localization by predicting the event relevant score and classification score. Extensive experimental results demonstrate that our method significantly outperforms the state-of-the-arts in both supervised and weakly-supervised AVE settings.
更多
查看译文
关键词
Audio-visual event localization, Cross-modality relation, Attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要