Speech Emotion Recognition via an Attentive Time–Frequency Neural Network

arxiv(2023)

引用 3|浏览7
暂无评分
摘要
Spectrogram is commonly used as the input feature of deep neural networks to learn the high(er)-level time–frequency pattern of speech signal for speech emotion recognition (SER). Generally, different emotions correspond to specific energy activations both within frequency bands and time frames on spectrogram, which indicates the frequency and time domains are both essential to represent the emotion for SER. However, recent spectrogram-based works mainly focus on modeling the long-term dependency in time domain, which makes these methods suffer from the following issues: 1) neglecting to model the emotion-related correlations within frequency domain during the time–frequency joint learning and 2) ignoring to capture the specific frequency bands associated with emotions. To cope with the issues, we propose an attentive time–frequency neural network (ATFNN) for SER, including a time–frequency neural network (TFNN) and time–frequency attention. Specifically, aiming at the first issue, we design a TFNN with a frequency-domain encoder (F-Encoder) based on the Transformer encoder and a time-domain encoder (T-Encoder) based on the bidirectional long short-term memory (Bi-LSTM). The F-Encoder and T-Encoder model the correlations within frequency bands and time frames, respectively, and they are embedded into a time–frequency joint learning strategy to obtain the time–frequency patterns of speech emotions. Moreover, to handle the second issue, we adopt the time–frequency attention with a frequency-attention network (F-Attention) and a time-attention network (T-Attention) to focus on the emotion-related long-range dependencies between frequency bands and across time frames, which can enhance the emotional discrimination of speech features. Extensive experimental results on three public emotional databases, i.e., IEMOCAP, ABC, and CASIA, show that our proposed ATFNN outperforms the state-of-the-art methods.
更多
查看译文
关键词
Attention mechanism,spectrogram,speech emotion recognition (SER),time-frequency neural network (TFNN)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要