Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
CVPR 2024(2024)
摘要
Multimodal Large Language Model (MLLMs) leverages Large Language Models as a
cognitive framework for diverse visual-language tasks. Recent efforts have been
made to equip MLLMs with visual perceiving and grounding capabilities. However,
there still remains a gap in providing fine-grained pixel-level perceptions and
extending interactions beyond text-specific inputs. In this work, we propose
AnyRef, a general MLLM model that can generate pixel-wise object
perceptions and natural language descriptions from multi-modality references,
such as texts, boxes, images, or audio. This innovation empowers users with
greater flexibility to engage with the model beyond textual and regional
prompts, without modality-specific designs. Through our proposed refocusing
mechanism, the generated grounding output is guided to better focus on the
referenced object, implicitly incorporating additional pixel-level supervision.
This simple modification utilizes attention scores generated during the
inference of LLM, eliminating the need for extra computations while exhibiting
performance enhancements in both grounding masks and referring expressions.
With only publicly available training data, our model achieves state-of-the-art
results across multiple benchmarks, including diverse modality referring
segmentation and region-level referring expression generation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要