Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning

IEEE Transactions on Multimedia(2023)

引用 4|浏览44
Self-attention (SA) based networks have achieved great success in image captioning, constantly dominating the leaderboards of online benchmarks. However, existing SA networks still suffer from distance insensitivity and low-rank bottleneck. In this paper, we aim to optimize SA in terms of two aspects, thereby addressing the above issues. First, we introduce a Distance-sensitive Self-Attention (DSA), which considers the raw geometric distances between query-key pairs in the 2D images during SA modeling. Second, we present a simple yet effective approach, named Multi-branch Self-Attention (MSA) to compensate for the low-rank bottleneck. MSA treats a multi-head self-attention layer as a branch and duplicates it multiple times to increase the expressive power of SA. To validate the effectiveness of the two designs, we apply them to the standard self-attention network, and conduct extensive experiments on the highly competitive MS-COCO dataset. We achieve new state-of-the-art performance on both the local and online test sets, i.e ., 135.1% CIDEr on the Karpathy split and 135.4% CIDEr on the official online split.
Image captioning,multi-branch techniques,distance-sensitive positional embedding
AI 理解论文
Chat Paper