Cross Modal Evaluation Of High Quality Emotional Speech Synthesis With The Virtual Human Toolkit

Blaise Potard,Matthew P. Aylett,David A. Baude

INTELLIGENT VIRTUAL AGENTS, IVA 2016（2016）

引用 3|浏览21

暂无评分

摘要

Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with prerecorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale.

查看译文

关键词

Speech synthesis, Unit selection, Expressive speech synthesis, Emotion, Prosody, Facial animation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要