CLAP4Emo: ChatGPT-Assisted Speech Emotion Retrieval with Natural Language Supervision

Wei-Cheng Lin, Shabnam Ghaffarzadegan,Luca Bondi, Abinaya Kumar,Samarjit Das, Ho-Hsiang Wu

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览0
暂无评分
摘要
Speech emotion retrieval is an important technique for large-scale and high-quality data collection. Conventional approach using ensemble of classification models might limit the retrieved emotion diversity and/or underperform in out-of-domain acoustic conditions. Natural language is diverse and agnostic to specific acoustic concepts, embedding a huge potential for developing language-based speech emotion retrieval system. In this paper we introduce CLAP4Emo, a novel framework to retrieve emotional speech via natural language prompts based on contrastive language-audio pretraining. To compensate for the absence of training captions in existing public datasets, we propose a systematic framework that applies ChatGPT to generate emotion captions. The experimental results demonstrate that our method can effectively improve the retrieved sample diversity while maintaining high precision across five benchmark datasets. By leveraging large language models, we establish a connection between audio and language for emotion description, culminating in an intuitive and interactive retrieval system. We release the generated emotion captions at: https://github.com/boschresearch/soundsee-emo-caps
更多
查看译文
关键词
speech emotion retrieval,contrastive language-audio pretraining,large language models,foundation model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要