Question Splitting And Unbalanced Multi-Modal Pooling For Vqa

2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW)(2019)

引用 4|浏览40
暂无评分
摘要
Visual question answering (VQA) is a cross-modal learning task that requires question understanding, image interpreting, and associating question with image. Most models generally did not consider to use different question parts in different modules, nor took into account the different roles of multimodal features in fusion. In this paper, we proposed a question splitting and unbalanced multi-modal pooling approach. The question is split into two parts. One part contains object information called question footer. The other contains question type information called question header. Then we superimposed several layers of feature reinforcement linear overlaps on the basis of Multi-Modal Factorized Bilinear Pooling in order to give them different weights. Considering the interaction of multi-modal features, our model also introduces the co-attention mechanism. Experimental results demonstrated our framework is superior to the previous models such as Oracle(GVQA,SAN) and QRU. The accuracy of our model increased from 61.96% to 64.44% on VQA 2.0 dataset and from 62.5% to 65.72% on COCO QA dataset.
更多
查看译文
关键词
VQA, question splitting, unbalanced multi-modal pooling, 1D_GCNN, co-attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要