Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
arxiv(2024)
摘要
An effective method for combining frozen large language models (LLM) and
visual encoders involves a resampler module that creates a `visual prompt'
which is provided to the LLM, along with the textual prompt. While this
approach has enabled impressive performance across many coarse-grained tasks
like image captioning and visual question answering, more fine-grained tasks
that require spatial understanding have not been thoroughly examined. In this
paper, we use diagnostic classifiers to measure the extent to which
the visual prompt produced by the resampler encodes spatial information. Our
results show that this information is largely absent from the resampler output
when kept frozen during training of the classifiers. However, when the
resampler and classifier are trained jointly, we observe a significant
performance boost. This shows that the compression achieved by the resamplers
can in principle encode the requisite spatial information, but that more
object-aware objectives are needed at the pretraining stage to facilitate this
capability
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要