CM-CS: Cross-Modal Common-Specific Feature Learning For Audio-Visual Video Parsing

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览13
暂无评分
摘要
The weakly-supervised audio-visual video parsing (AVVP) task aims to parse duration and categories of each snippet when only the video-level event labels are provided. Most methods either leverage attention mechanisms to explore cross-modal and cross-video event semantics or alleviate label noise to improve performance. However, the distributional modality discrepancy caused by the heterogeneity of signals remains a significant challenge. To this end, we propose a novel cross-modal common-specific feature learning method (cm-CS) to map the modal features into modality-common and modality-specific subspaces. The former aims to capture similar high-level scene cue across different modalities, while the later attempts to capture specific cue. The proposed method is applied among and across in-visual 2D-3D modalities, audio-visual modalities, respectively. In addition, we design a training strategy to strengthen the learning of similarity and differences across modalities. Experiments show a large improvement of our method against existing works on the Look, Listen, and Parse (LLP) dataset (e.g. from 58.9% to 62.9% in video-level visual metric).
更多
查看译文
关键词
modality discrepancy,common-specific feature encoding,in-visual,audio-visual
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要