Factored Spatial And Spectral Multichannel Raw Waveform Cldnns

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2016)

引用 82|浏览105
暂无评分
摘要
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. Recently, we explored doing multichannel enhancement jointly with acoustic modeling, where beamforming and frequency decomposition was folded into one layer of the neural network [1, 2]. In this paper, we explore factoring these operations into separate layers in the network. Furthermore, we explore using multi-task learning (MTL) as a proxy for postfiltering, where we train the network to predict "clean" features as well as context-dependent states. We find that with the factored architecture, we can achieve a 10% relative improvement in WER over a single channel and a 5% relative improvement over the unfactored model from [1] on a 2,000-hour Voice Search task. In addition, by incorporating MTL, we can achieve 11% and 7% relative improvements over single channel and unfactored multichannel models, respectively.
更多
查看译文
关键词
spatial multichannel raw waveform,spectral multichannel raw waveform,convolutional long short-term memory deep neural networks,CLDNN,automatic speech recognition,multichannel ASR systems,speech enhancement,beamforming,postfiltering,acoustic modeling,frequency decomposition,multitask learning,MTL,voice search task,multichannel models,time 2000 hour
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要