LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions
CoRR(2023)
摘要
Vision-language models (VLMs) offer a promising paradigm for image
classification by comparing the similarity between images and class embeddings.
A critical challenge lies in crafting precise textual representations for class
names. While previous studies have leveraged recent advancements in large
language models (LLMs) to enhance these descriptors, their outputs often suffer
from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the
reliance on single-turn textual interactions with LLMs, leading to a mismatch
between generated text and visual concepts for VLMs; 2) the oversight of the
inter-class relationships, resulting in descriptors that fail to differentiate
similar classes effectively. In this paper, we propose a novel framework that
integrates LLMs and VLMs to find the optimal class descriptors. Our
training-free approach develops an LLM-based agent with an evolutionary
optimization strategy to iteratively refine class descriptors. We demonstrate
our optimized descriptors are of high quality which effectively improves
classification accuracy on a wide range of benchmarks. Additionally, these
descriptors offer explainable and robust features, boosting performance across
various backbone models and complementing fine-tuning-based methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要