CLIP-MSA: Incorporating Inter-Modal Dynamics and Common Knowledge to Multimodal Sentiment Analysis With Clip

Qi Huang, Pingting Cai, Tanyue Nie,Jinshan Zeng

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览4
暂无评分
摘要
Multimodal Sentiment Analysis (MSA) aims to yield the sentiment polarities of speakers in video streams based on multiple modal features such as textual, acoustic and visual features, and has attracted amounts of attention in recent years. Existing MSA models often yield unimodal embeddings from the associated modal features individually, while overlooking the importance of inter-modal dynamics and common knowledge in the extraction of unimodal embeddings, resulting in the limited performance. In this paper, we suggest a novel MSA model called CLIP-MSA through incorporating the inter-modal dynamics and common knowledge into the generation of unimodal representations with the Contrastive Language-Image Pre-training (CLIP), and fusing the textual, acoustic and visual representations with a hierarchical co-attention mechanism. Numerous experimental results over two benchmark datasets show that the proposed model outperforms existing state-of-the-art models on CMU-MOSI, and provides competitive performance on CMU-MOSEI, in terms of four commonly used evaluation metrics.
更多
查看译文
关键词
Multimodal sentiment analysis,CLIP,Hierarchical co-attention,Representation learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要