GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
CVPR 2024(2024)
摘要
Most multimodal large language models (MLLMs) learn language-to-object
grounding through causal language modeling where grounded objects are captured
by bounding boxes as sequences of location tokens. This paradigm lacks
pixel-level representations that are important for fine-grained visual
understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM
developed by grounding Large Language Models to holistic segmentation.
GROUNDHOG incorporates a masked feature extractor and converts extracted
features into visual entity tokens for the MLLM backbone, which then connects
groundable phrases to unified grounding masks by retrieving and merging the
entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual
instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by
harvesting a collection of segmentation-grounded datasets with rich
annotations. Our experimental results show that GROUNDHOG achieves superior
performance on various language grounding tasks without task-specific
fine-tuning, and significantly reduces object hallucination. GROUNDHOG also
demonstrates better grounding towards complex forms of visual input and
provides easy-to-understand diagnosis in failure cases.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要