Knowing What it is: Semantic-Enhanced Dual Attention Transformer

IEEE TRANSACTIONS ON MULTIMEDIA(2023)

引用 5|浏览24
暂无评分
摘要
Attention has become an indispensable component of the models of various multimedia tasks like Image Captioning (IC) and Visual Question Answering (VQA). However, most existing attention modules are designed for capturing the spatial dependency, and are still insufficient in semantic understanding, e.g., the categories of objects and their attributes, which is also critical for image captioning. To compensate for this defect, we propose a novel attention module termed Channel-wise Attention Block (CAB) to model channel-wise dependency for both visual modality and linguistic modality, thereby improving semantic learning and multi-modal reasoning simultaneously. Specifically, CAB has two novel designs to tackle with the high overhead of channel-wise attention, which are the reduction-reconstruction block structure and the gating-based attention prediction. Based on CAB, we further propose a novel Semantic-enhanced Dual Attention Transformer (termedSDATR), which combines the merits of spatial and channel-wise attentions. To validate SDATR, we conduct extensive experiments on the MS COCO dataset and yield new state-of-the-art performance of 134.5 CIDEr score on COCO Karpathy test split and 136.0 CIDEr score on the official online testing server. To examine the generalization of SDATR, we also apply it to the task of visual question answering, where superior performance gains are also witnessed. The code and models are publicly available at https:// github.com/ xmu-xiaoma666/SDATR.
更多
查看译文
关键词
Transformers,Task analysis,Semantics,Visualization,Integrated circuit modeling,Standards,Head,Image captioning,visual question answering,attention mechanism,transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要