A Novel Cross-Fusion Method of Different Types of Features for Image Captioning
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN(2023)
摘要
Multi-modal tasks are receiving more and more attention, including image captioning. Based on X-Linear attention, we simultaneously introduce grid features and region features extracted by Faster RCNN. We obtain a global feature vector of each type of original features through mean pooling. The two types of features are encoded by two parallel encoders. Each encoder has two inputs: a set of feature vectors (region/grid) and the corresponding global feature vector. Each encoding layer outputs an encoded global feature vector and a set of encoded feature vectors. We cross-fuse the global feature vector output by each encoding layer for region features and the set of encoded feature vectors for grid features. In the same way, we cross-fuse another pair of the global feature (grid) and the set of encoded feature vectors (region). Finally, we fuse the two global feature vectors output by the two encoders as the final global features, and the two sets of encoded feature vectors output by the two encoders as the final visual features. Experimental results on the COCO dataset show that our model achieves a new SOTA performance of BLEU-1 81.5%, BLEU-4 40.5%, METEOR 29.6%, and ROUGE 59.5% on the Karpathy test split.
更多查看译文
关键词
Image Captioning,Region Features,Grid Features,Transformer,Cross-Fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要