Look, listen, and decode: Multimodal speech recognition with images

Felix Sun,David F. Harwath,James R. Glass

2016 IEEE Spoken Language Technology Workshop (SLT)（2016）

引用 26|浏览89

暂无评分

摘要

In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.

查看译文

关键词

Multimodal speech recognition,image captioning,CNN,lattices

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要