GroupCap: Group-Based Image Captioning with Structured Relevance and Diversity Constraints

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)(2018)

引用 85|浏览82
暂无评分
摘要
Most image captioning models focus on one-line (single image) captioning, where the correlations like relevance and diversity among group images (e.g., within the same album or event) are simply neglected, resulting in less accurate and diverse captions. Recent works mainly consider imposing the diversity during the online inference only, which neglect the correlation among visual structures in offline training. In this paper, we propose a novel group-based image captioning scheme (termed GroupCap), which jointly models the structured relevance and diversity among group images towards an optimal collaborative captioning. In particular, we first propose a visual tree parser (VP-Tree) to construct the structured semantic correlations within individual images. Then, the relevance and diversity among images are well modeled by exploiting the correlations among their tree structures. Finally, such correlations are modeled as constraints and sent into the LSTM-based captioning generator. We adopt an end-to-end formulation to train the visual tree parser, the structured relevance and diversity constraints, as well as the LSTM based captioning model jointly. To facilitate quantitative evaluation, we further release two group captioning datasets derived from the MS-COCO benchmark, serving as the first of their kind. Quantitative results show that the proposed GroupCap model outperforms the state-of-the-art and alternative approaches.
更多
查看译文
关键词
structured semantic correlations,tree structures,visual tree parser,structured relevance,group captioning datasets,GroupCap model,image captioning models,one-line captioning,visual structures,optimal collaborative captioning,LSTM-based captioning generator,group-based image captioning scheme,diversity constraints
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要