Chrome Extension
WeChat Mini Program
Use on ChatGLM

On Diversity in Image Captioning: Metrics and Methods

IEEE transactions on pattern analysis and machine intelligence(2022)

Cited 33|Views78
No score
Abstract
Diversity is one of the most important properties in image captioning, as it reflects various expressions of important concepts presented in an image. However, the most popular metrics cannot well evaluate the diversity of multiple captions. In this paper, we first propose a metric to measure the diversity of a set of captions, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr (R. Vedantam et al., 2015) similarity. Compared with mBLEU (R. Shetty et al., 2017), our proposed diversity metrics show a relatively strong correlation to human evaluation. We conduct extensive experiments, finding there is a large gap between the performance of the current state-of-the-art models and human annotations considering both diversity and accuracy; the models that aim to generate captions with higher CIDEr scores normally obtain lower diversity scores, which generally learn to describe images using common words. To bridge this "diversity" gap, we consider several methods for training caption models to generate diverse captions. First, we show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions. Second, we develop approaches that directly optimize our diversity metric and CIDEr score using reinforcement learning. These proposed approaches using reinforcement learning (RL) can be unified into a self-critical (S. J. Rennie et al., 2017) framework with new RL baselines. Third, we combine accuracy and diversity into a single measure using an ensemble matrix, and then maximize the determinant of the ensemble matrix via reinforcement learning to boost diversity and accuracy, which outperforms its counterparts on the oracle test. Finally, inspired by determinantal point processes (DPP), we develop a DPP selection algorithm to select a subset of captions from a large number of candidate captions. The experimental results show that maximizing the determinant of the ensemble matrix outperforms other methods considerably improving diversity and accuracy.
More
Translated text
Key words
Measurement,Semantics,Learning (artificial intelligence),Vegetation,Legged locomotion,Training,Computational modeling,Image captioning,diverse captions,reinforcement learning,policy gradient,adversarial training,diversity metric
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined