谷歌浏览器插件
订阅小程序
在清言上使用

A Novel Cross-Fusion Method of Different Types of Features for Image Captioning

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN(2023)

引用 0|浏览9
暂无评分
摘要
Multi-modal tasks are receiving more and more attention, including image captioning. Based on X-Linear attention, we simultaneously introduce grid features and region features extracted by Faster RCNN. We obtain a global feature vector of each type of original features through mean pooling. The two types of features are encoded by two parallel encoders. Each encoder has two inputs: a set of feature vectors (region/grid) and the corresponding global feature vector. Each encoding layer outputs an encoded global feature vector and a set of encoded feature vectors. We cross-fuse the global feature vector output by each encoding layer for region features and the set of encoded feature vectors for grid features. In the same way, we cross-fuse another pair of the global feature (grid) and the set of encoded feature vectors (region). Finally, we fuse the two global feature vectors output by the two encoders as the final global features, and the two sets of encoded feature vectors output by the two encoders as the final visual features. Experimental results on the COCO dataset show that our model achieves a new SOTA performance of BLEU-1 81.5%, BLEU-4 40.5%, METEOR 29.6%, and ROUGE 59.5% on the Karpathy test split.
更多
查看译文
关键词
Image Captioning,Region Features,Grid Features,Transformer,Cross-Fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要