CM-CS: Cross-Modal Common-Specific Feature Learning For Audio-Visual Video Parsing

Hongbo Chen,Dongchen Zhu,Guanghui Zhang,Wenjun Shi,Xiaolin Zhang,Jiamao Li

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2023）

引用 0|浏览13

暂无评分

摘要

The weakly-supervised audio-visual video parsing (AVVP) task aims to parse duration and categories of each snippet when only the video-level event labels are provided. Most methods either leverage attention mechanisms to explore cross-modal and cross-video event semantics or alleviate label noise to improve performance. However, the distributional modality discrepancy caused by the heterogeneity of signals remains a significant challenge. To this end, we propose a novel cross-modal common-specific feature learning method (cm-CS) to map the modal features into modality-common and modality-specific subspaces. The former aims to capture similar high-level scene cue across different modalities, while the later attempts to capture specific cue. The proposed method is applied among and across in-visual 2D-3D modalities, audio-visual modalities, respectively. In addition, we design a training strategy to strengthen the learning of similarity and differences across modalities. Experiments show a large improvement of our method against existing works on the Look, Listen, and Parse (LLP) dataset (e.g. from 58.9% to 62.9% in video-level visual metric).

查看译文

关键词

modality discrepancy,common-specific feature encoding,in-visual,audio-visual

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要