BLSP-Emo: Towards Empathetic Large Speech-Language Models
CoRR(2024)
Abstract
The recent release of GPT-4o showcased the potential of end-to-end multimodal
models, not just in terms of low latency but also in their ability to
understand and generate expressive speech with rich emotions. While the details
are unknown to the open research community, it likely involves significant
amounts of curated data and compute, neither of which is readily accessible. In
this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with
Emotion support), a novel approach to developing an end-to-end speech-language
model capable of understanding both semantics and emotions in speech and
generate empathetic responses. BLSP-Emo utilizes existing speech recognition
(ASR) and speech emotion recognition (SER) datasets through a two-stage
process. The first stage focuses on semantic alignment, following recent work
on pretraining speech-language models using ASR data. The second stage performs
emotion alignment with the pretrained speech-language model on an emotion-aware
continuation task constructed from SER data. Our experiments demonstrate that
the BLSP-Emo model excels in comprehending speech and delivering empathetic
responses, both in instruction-following tasks and conversations.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined