Modeling Caption Diversity in Contrastive Vision-Language Pretraining
arxiv(2024)
摘要
There are a thousand ways to caption an image. Contrastive Language
Pretraining (CLIP) on the other hand, works by mapping an image and its caption
to a single vector – limiting how well CLIP-like models can represent the
diverse ways to describe an image. In this work, we introduce Llip, Latent
Language Image Pretraining, which models the diversity of captions that could
match an image. Llip's vision encoder outputs a set of visual features that are
mixed into a final representation by conditioning on information derived from
the text. We show that Llip outperforms non-contextualized baselines like CLIP
and SigLIP on a variety of tasks even with large-scale encoders. Llip improves
zero-shot classification by an average of 2.9
benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot
top-1 accuracy of 83.5
1.4
6.0
method and demonstrate that Llip leads to richer visual representations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要