SPT: Spatial Pyramid Transformer for Image Captioning

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 0|浏览9
暂无评分
摘要
The existing approaches to image captioning tend to adopt Transformer-based architectures with grid features, which represent the state-of-the-art. However, the strategies are prone to address the grid features with a fixed resolution, which often hampers the perception of entities with various scales. In addition, directly applying them may also result in spatial and fine-grained semantic information loss. To this end, we propose a simple yet effective method, named Spatial Pyramid Transformer (SPT). Specifically, it adopts several parameter-shared pyramid structures to perform semantic interactions across different grid resolutions. In each layer, we design a Spatial-aware Pseudo-supervised (SP) module, which aims to adaptively resort to disrupted spatial information among flatted grid features. Moreover, to maintain the model size and enhance semantics, we build a simple weighted residual connection termed as Scale-wise Reinforcement (SR) module to simultaneously explore both low- and high-level encoded features. Extensive experiments on the MS-COCO benchmark demonstrate that our method achieves new state-of-the-art performance without bringing excessive parameters compared with vanilla transformer. In addition, our method is extended to the video captioning task, which further proves the practicability of the proposed method. Code is available at https://github.com/zchoi/SPT.
更多
查看译文
关键词
Image Captioning,Video Captioning,Transformer,Pyramidal structure,Clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要