Direct modeling of raw audio with DNNS for wake word detection

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2017)

引用 29|浏览32
暂无评分
摘要
In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional speech recognition systems typically extract a compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such a feature is then used for training a deep neural network (DNN) acoustic model (AM). In contrast, we directly train the WW DNN AM from the single-channel audio data in a stage-wise manner. We first build a feature extraction DNN with a small hidden bottleneck layer, and train this bottleneck feature representation using the same multi-task cross-entropy objective function as we use to train our WW DNNs. Then, the WW classification DNN is trained with input bottleneck features, keeping the feature extraction layers fixed. Finally, the feature extraction and classification DNNs are combined and then jointly optimized. We show the effectiveness of this stage-wise training technique through a set of experiments on real beam-formed far-field data. The experiment results show that the audioinput DNN provides significantly lower miss rates for a range of false alarm rates over the LFBE when a sufficient amount of training data is available, yielding approximately 12 % relative improvement in the area under the curve (AUC).
更多
查看译文
关键词
Keyword Spotting,Distant Speech,Raw Audio Input Acoustic Modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要