Image captioning: Semantic selection unit with stacked residual attention

Image and Vision Computing(2024)

引用 0|浏览12
暂无评分
摘要
Semantic information and attention mechanism play important roles in the task of image captioning. Semantic information can strengthen the relationship between images and languages, while attention operation can steer the relevant regions spatially in the image. However, in most current works, semantic attributes are always confined to be learned from pairs of images and sentences, which ignore to fully utilize more semantic attributes and the structure information of sentences, thus limit the variety of sentences to be generated. Meanwhile, current attention models usually lack the ability to learn the positional information in an explicit way during attention generation, and have the problem of vanishing gradient in the training process. This paper proposes a Semantic Selection Unit (SSU) and a Stacked Residual Attention (SRA) to remedy these drawbacks. Specifically, the SSU is designed to capture selectively semantic information from expanding attributes or guidance sentences. With the help of expanding vocabulary and the structure information in sentences, the SSU can improve the quality of the generated sentences. The SRA is constructed to solve the problem of positional information missing and vanishing gradient problem during attention generation. Architecturally, the SSU and SRA work together in a jointed framework with end-to-end learning for image captioning. Extensive experiments have been conducted on the public dataset of the MS COCO, achieving 139.7 CIDEr score on the test set.
更多
查看译文
关键词
Image captioning,Semantic attributes,Semantic selection unit,Transformer,Stacked residual attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要