How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition
IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1052-1064, 2020.
Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements. A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligni...More
PPT (Upload PPT)