A Cross-modal and Redundancy-reduced Network for Weakly-Supervised Audio-Visual Violence Detection.

ACM Multimedia Asia(2023)

引用 0|浏览7
暂无评分
摘要
Multimodal learning using audio and visual information has improved Violence Detection tasks. However, previous studies overlook the gap between pre-trained networks and the final violence detection task, as well as the semantic inconsistency between audio and visual features. We consider task-irrelevant information caused by the former situation and semantic noise due to the latter as redundancy, negatively affecting overall detection performance. Besides, the prevailing visual modality-centric approach with audio features as guidance may be biased. We contend that both modalities are crucial in violence detection. To address these issues, we propose a Cross-modal and Redundancy-reduced Network for Weakly-Supervised Audio-Visual Violence Detection. Our framework integrates a relation-ware module with a bi-directional cross-modal attention mechanism to explore interactions between modalities. Then, we introduce a feature filter gate to reduce redundancy. Finally, a multi-branch classification module is proposed for better utilization of both modalities. Extensive experiments demonstrate the effectiveness of our approach, surpassing previous methods with state-of-the-art performance in violence detection.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要