Look, listen, and decode: Multimodal speech recognition with images

2016 IEEE Spoken Language Technology Workshop (SLT)(2016)

引用 26|浏览89
暂无评分
摘要
In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.
更多
查看译文
关键词
Multimodal speech recognition,image captioning,CNN,lattices
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要