BPCN: A simple and efficient model for visual question answering.

HPCC/DSS/SmartCity/DependSys(2022)

引用 0|浏览0
暂无评分
摘要
Visual question answering (VQA) is a cross modal task that combines computer vision tasks and natural language processing tasks. Due to the traditional attention mechanism model can not effectively solve the gap between the high-level semantics of words and the low-level abstract pixels of images, we use the question guided multi-head concatenated attention mechanism to map the question features and image features to the shared space, then obtain the question guided image attention and the image features related to the question. In addition, existing VQA models usually use static word vectors to encode text features. In this paper, based on the BERT's dynamic word vector encoding question, the accuracy of the model is further improved. Our model is simple and easy to train, with low complexity and significantly improved performance. Without using VG dataset for data enhancement, our model reached 69.28% on the test-std set.
更多
查看译文
关键词
attention mechanism,BERT,multi-modal,visual question answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要