Towards local visual modeling for image captioning
Pattern Recognition(2023)
摘要
•Local visual modeling with grid features for image captioning.•Locality-Sensitive Attention (LSA) is deployed for the intra-layer interaction via local visual modeling.•Locality-Sensitive Fusion (LSF) is used for inter-layer information fusion.•Locality-Sensitive Transformer Network (LSTNet) outperforms SOTA captioning models on MS-COCO.•The generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets.
更多查看译文
关键词
Image captioning,Attention mechanism,Local visual modeling
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要