Extract, Adapt and Recognize: An End-to-End Neural Network for Corrupted Monaural Speech Recognition

INTERSPEECH(2019)

引用 7|浏览65
暂无评分
摘要
Automatic speech recognition (ASR) in challenging conditions, such as in the presence of interfering speakers or music, remains an unsolved problem. This paper presents Extract, Adapt, and Recognize (EAR), an end-to-end neural network that allows fully learnable separation and recognition components towards optimizing the ASR criterion. In between a state-of-the-art speech separation module as an extractor and an acoustic modeling module as a recognizer, the EAR introduces an adaptor, where adapted acoustic features are learned from the separation outputs using a bi-directional long short term memory network trained to minimize the recognition loss directly. Relative to a conventional joint training model, the EAR model can achieve 8.5% to 22.3%, and 1.2% to 26.9% word error rate reductions (WERR), under various dBs of music corruption and speaker interference respectively. With speaker tracing the WERR can be further promoted to 12.4% to 29.0%.
更多
查看译文
关键词
monaural source separation, robust speech recognition, joint training, filter bank learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要