Chrome Extension
WeChat Mini Program
Use on ChatGLM

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

NAACL (Short Papers)(2024)

Cited 0|Views24
No score
Abstract
An effective method for combining frozen large language models (LLM) andvisual encoders involves a resampler module that creates a `visual prompt'which is provided to the LLM, along with the textual prompt. While thisapproach has enabled impressive performance across many coarse-grained taskslike image captioning and visual question answering, more fine-grained tasksthat require spatial understanding have not been thoroughly examined. In thispaper, we use diagnostic classifiers to measure the extent to whichthe visual prompt produced by the resampler encodes spatial information. Ourresults show that this information is largely absent from the resampler outputwhen kept frozen during training of the classifiers. However, when theresampler and classifier are trained jointly, we observe a significantperformance boost. This shows that the compression achieved by the resamplerscan in principle encode the requisite spatial information, but that moreobject-aware objectives are needed at the pretraining stage to facilitate thiscapability
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined