How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Sterpu George
Sterpu George
Saam Christian
Saam Christian

IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1052-1064, 2020.

Cited by: 2|Views23
EI WOS

Abstract:

Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements. A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligni...More

Code:

Data:

Your rating :
0

 

Tags
Comments