Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition

Mo Sun,Jian Li, Hui Feng,Wei Gou,Haifeng Shen,Jian Tang,Yi Yang,Jieping Ye

ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION Virtual Event Netherlands October, 2020（2020）

引用 12|浏览237

暂无评分

摘要

This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要