Without detection

IET Computer Vision(2022)

引用 0|浏览6
暂无评分
摘要
The current image captioning methods usually integrate an object detection network to obtain image features at the level of objects and other salient regions. However, the detection network needs to be independently pre‐trained on additional data. Thus, mainly due to the demand for extra training data and computing resources, the detection network's utilization will impose higher training costs on the overall captioning model. In this work, the authors propose a local–global attention model based on two‐step clustering features for image captioning. The two‐step clustering features can be obtained at a relatively low cost and have the presentation ability in objects or other salient image regions. To make the model perceive the image better, the authors introduce a novel local–global attention mechanism. The model will analyse the clustering features from local perspectives to global ones at each time step, making the model better understand the image contents. The authors evaluate the proposed method on the MSCOCO test server, achieving BLEU‐4/METEOR/ROUGE‐L scores of 36.8, 27.4, and 57.2, respectively. With the benefit of reducing training costs, the authors' model also achieves closing results compared with the models using detection features.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要