A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Signal Processing Systems(2017)

引用 5|浏览29
暂无评分
摘要
We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016 ). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.
更多
查看译文
关键词
Multi-talker speech recognition,Speaker-dependent model,Single-channel speech separation,Deep neural networks,Joint training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要