A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction

INTERSPEECH(2019)

引用 20|浏览17
暂无评分
摘要
In this paper, we extend our previous work on device-directed utterance detection, which aims to distinguish voice queries intended for a smart-home device from background speech. The task can be phrased as a binary utterance-level classification problem that we approach with a DNN-LSTM model using acoustic features and features from the automatic speech recognition (ASR) decoder as input. In this work, we study the performance of the model for different dialog types and for different categories of decoder features. To address different dialog types, we found that a model with a separate output branch for each dialog type outperforms a model with a shared output branch by a relative 12.5% of equal error rate (EER) reduction. We also found the average number of arcs in a confusion network to be one of the most informative ASR decoder features. In addition, we explore different frequencies of backward propagation for training the acoustic embedding for every k frames (k=1,3,5,7), and mean and attention pooling methods for generating an utterance representation. We found that attention pooling provides the most discriminative utterance representation and outperforms mean pooling by a relative 4.97% of EER reduction.
更多
查看译文
关键词
speech recognition, human-computer interaction, computational paralinguistics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要