Empowering Segmentation Ability to Multi-modal Large Language Models
CoRR(2024)
摘要
Multi-modal large language models (MLLMs) can understand image-language
prompts and demonstrate impressive reasoning ability. In this paper, we extend
MLLMs' output by empowering MLLMs with the segmentation ability. The extended
MLLMs can both output language responses to the image-language prompts and
segment the regions that the complex question or query in the language prompts
focuses on. To this end, the existing work, LISA, enlarges the original word
embeddings with an additional segment token and fine-tunes dialogue generation
and query-focused segmentation together, where the feature of the segment token
is used to prompt the segment-anything model. Although they achieve superior
segmentation performance, we observe that the dialogue ability decreases by a
large margin compared to the original MLLMs. To maintain the original MLLMs'
dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which
leverages a chain-of-thought prompting strategy to instruct the MLLMs to
segment the target region queried by the user. The MLLMs are first prompted to
reason about the simple description of the target region from the complicated
user query, then extract the visual attributes of the target region according
to the understanding of MLLMs to the image. These visual attributes, such as
color and relative locations, are utilized to prompt the downstream
segmentation model. Experiments show that the proposed method keeps the
original dialogue ability and equips the MLLMs' model with strong reasoning
segmentation ability. The code is available at
https://github.com/YuqiYang213/LLaVASeg.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要