Transformer with a Parallel Decoder for Image Captioning

INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE(2024)

引用 0|浏览5
暂无评分
摘要
In this paper, a parallel decoder and a word group prediction module are proposed to speed up decoding and improve the effect of captions. The features of the image extracted by the encoder are linearly projected to different word groups, and then a unique relaxed mask matrix is designed to improve the decoding speed and the caption effect. First, since image captioning is composed of many words, sentences can also be broken down into word groups or words according to their syntactic structure, and we achieve this function through constituency parsing. Second, we make full use of the extracted features to predict the size of word groups. Then, a new embedding representing the information of the word is proposed based on word embedding. Finally, with the help of word groups, we design a mask matrix to modify the decoding process so that each step of the model can produce one or more words in parallel. Experiments on public datasets demonstrate that our method can reduce the time complexity while maintaining competitive performance.
更多
查看译文
关键词
Image captioning,constituency parsing,word groups,time complexity,transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要