Inferring Emphasis For Real Voice Data: An Attentive Multimodal Neural Network Approach

MULTIMEDIA MODELING (MMM 2020), PT II(2020)

引用 2|浏览89
暂无评分
摘要
To understand speakers' attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users' queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users' dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant (https://yy.sogou.com/) show that our method outperforms (over 1.0-15.5% in terms of F1 measure) alternative baselines.
更多
查看译文
关键词
Emphasis detection, Voice dialogue applications, Attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要