A Cross-modal and Redundancy-reduced Network for Weakly-Supervised Audio-Visual Violence Detection.

Yidan Fan,Yongxin Yu,Wenhuan Lu,Yahong Han

ACM Multimedia Asia（2023）

引用 0|浏览7

暂无评分

摘要

Multimodal learning using audio and visual information has improved Violence Detection tasks. However, previous studies overlook the gap between pre-trained networks and the final violence detection task, as well as the semantic inconsistency between audio and visual features. We consider task-irrelevant information caused by the former situation and semantic noise due to the latter as redundancy, negatively affecting overall detection performance. Besides, the prevailing visual modality-centric approach with audio features as guidance may be biased. We contend that both modalities are crucial in violence detection. To address these issues, we propose a Cross-modal and Redundancy-reduced Network for Weakly-Supervised Audio-Visual Violence Detection. Our framework integrates a relation-ware module with a bi-directional cross-modal attention mechanism to explore interactions between modalities. Then, we introduce a feature filter gate to reduce redundancy. Finally, a multi-branch classification module is proposed for better utilization of both modalities. Extensive experiments demonstrate the effectiveness of our approach, surpassing previous methods with state-of-the-art performance in violence detection.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要