Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild.
IEEE Trans. Circuits Syst. Video Technol.(2024)
Abstract
Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.
MoreTranslated text
Key words
Dynamic facial expression recognition,Multimodal information fusion,Semantic alignment,Deep learning
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined