Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition
CoRR(2024)
摘要
Speaker embeddings carry valuable emotion-related information, which makes
them a promising resource for enhancing speech emotion recognition (SER),
especially with limited labeled data. Traditionally, it has been assumed that
emotion information is indirectly embedded within speaker embeddings, leading
to their under-utilization. Our study reveals a direct and useful link between
emotion and state-of-the-art speaker embeddings in the form of intra-speaker
clusters. By conducting a thorough clustering analysis, we demonstrate that
emotion information can be readily extracted from speaker embeddings. In order
to leverage this information, we introduce a novel contrastive pretraining
approach applied to emotion-unlabeled data for speech emotion recognition. The
proposed approach involves the sampling of positive and the negative examples
based on the intra-speaker clusters of speaker embeddings. The proposed
strategy, which leverages extensive emotion-unlabeled data, leads to a
significant improvement in SER performance, whether employed as a standalone
pretraining task or integrated into a multi-task pretraining setting.
更多查看译文
关键词
Speech emotion recognition,speaker embeddings,clustering,contrastive learning,multi-task learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要