Robust Audio-Visual Speech Recognition Using Bimodal Dfsmn With Multi-Condition Training And Dropout Regularization

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2019)

引用 36|浏览103
暂无评分
摘要
Audio-visual speech recognition ( AVSR) is thought to be one of the potential solutions for robust speech recognition, especially in noisy environments. Compared to audio only speech recognition, the major issues of AVSR include the lack of publicly available audio-visual corpora and the need of robust knowledge fusion of both speech and vision. In this work, based on the recently released NTCD-TIMIT audio-visual corpus, we address the challenges of AVSR through three aspects: 1) optimal integration of acoustic and visual information; 2) robust performance with multi-condition training; 3) robust modeling against missing visual information during decoding. We propose a bimodal-DFSMN to jointly learn feature fusion and acoustic modeling, and utilize a per-frame dropout approach to enhance the robustness of AVSR system against the missing of visual modality. In the experiments, we construct two setups based on the NTCD-TIMIT corpus that consists of 5 hours clean training data and 150 hours multi-condition training data, respectively. As a result, we achieve a phone error rate of 12.6% on clean test set and an average phone error rate of 26.2% on all test sets ( clean, various SNRs, various noise types), which both dramatically improve the baseline performance in NTCD-TIMIT task.
更多
查看译文
关键词
Audio-visual speech recognition, bimodal DFSMN, robust speech recognition, dropout, multi-condition training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要