SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
CVPR 2024(2024)
摘要
Recent trends in Large Vision Language Models (LVLMs) research have been
increasingly focusing on advancing beyond general image understanding towards
more nuanced, object-level referential comprehension. In this paper, we present
and delve into the self-consistency capability of LVLMs, a crucial aspect that
reflects the models' ability to both generate informative captions for specific
objects and subsequently utilize these captions to accurately re-identify the
objects in a closed-loop process. This capability significantly mirrors the
precision and reliability of fine-grained visual-language understanding. Our
findings reveal that the self-consistency level of existing LVLMs falls short
of expectations, posing limitations on their practical applicability and
potential. To address this gap, we introduce a novel fine-tuning paradigm named
Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a
cyclic describer-locator system. This paradigm is not only data-efficient but
also exhibits generalizability across multiple LVLMs. Through extensive
experiments, we demonstrate that SC-Tune significantly elevates performance
across a spectrum of object-level vision-language benchmarks and maintains
competitive or improved performance on image-level vision-language benchmarks.
Both our model and code will be publicly available at
https://github.com/ivattyue/SC-Tune.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要