Densifying Supervision for Fine-Grained Visual Comparisons

International Journal of Computer Vision(2020)

引用 0|浏览69
暂无评分
摘要
Detecting subtle differences in visual attributes requires inferring which of two images exhibits a property more, e.g., which face is smiling slightly more, or which shoe is slightly more sporty. While valuable for applications ranging from biometrics to online shopping, fine-grained attributes are challenging to learn. Unlike traditional recognition tasks, the supervision is inherently comparative. Thus, the space of all possible training comparisons is vast, and learning algorithms face a sparsity of supervision problem: it is difficult to curate adequate subtly different image pairs for each attribute of interest. We propose to overcome this problem by densifying the space of training images with attribute-conditioned image generation. The main idea is to create synthetic but realistic training images exhibiting slight modifications of attribute(s), obtain their comparative labels from human annotators, and use the labeled image pairs to augment real image pairs when training ranking functions for the attributes. We introduce two variants of our idea. The first passively synthesizes training images by “jittering” individual attributes in real training images. Building on this idea, our second model actively synthesizes training image pairs that would confuse the current attribute model, training both the attribute ranking functions and a generation controller simultaneously in an adversarial manner. For both models, we employ a conditional Variational Autoencoder (CVAE) to perform image synthesis. We demonstrate the effectiveness of bootstrapping imperfect image generators to counteract supervision sparsity in learning-to-rank models. Our approach yields state-of-the-art performance for challenging datasets from two distinct domains.
更多
查看译文
关键词
Fine-grained,Ranking,Image generation,Relative attributes
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要