PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts
arxiv(2023)
摘要
Vision-language models like CLIP are widely used in zero-shot image
classification due to their ability to understand various visual concepts and
natural language descriptions. However, how to fully leverage CLIP's
unprecedented human-like understanding capabilities to achieve better
performance is still an open question. This paper draws inspiration from the
human visual perception process: when classifying an object, humans first infer
contextual attributes (e.g., background and orientation) which help separate
the foreground object from the background, and then classify the object based
on this information. Inspired by it, we observe that providing CLIP with
contextual attributes improves zero-shot image classification and mitigates
reliance on spurious features. We also observe that CLIP itself can reasonably
infer the attributes from an image. With these observations, we propose a
training-free, two-step zero-shot classification method PerceptionCLIP. Given
an image, it first infers contextual attributes (e.g., background) and then
performs object classification conditioning on them. Our experiments show that
PerceptionCLIP achieves better generalization, group robustness, and
interoperability. Our code is available at
https://github.com/umd-huang-lab/perceptionCLIP
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要