PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
CoRR(2024)
摘要
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address
the segmentation task challenges. To overcome the limitation of the LMM being
limited to textual output, PSALM incorporates a mask decoder and a
well-designed input schema to handle a variety of segmentation tasks. This
schema includes images, task instructions, conditional prompts, and mask
tokens, which enable the model to generate and classify segmentation masks
effectively. The flexible design of PSALM supports joint training across
multiple datasets and tasks, leading to improved performance and task
generalization. PSALM achieves superior results on several benchmarks, such as
RefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive,
and further exhibits zero-shot capabilities on unseen tasks, such as
open-vocabulary segmentation, generalized referring expression segmentation and
video object segmentation, making a significant step towards a GPT moment in
computer vision. Through extensive experiments, PSALM demonstrates its
potential to transform the domain of image segmentation, leveraging the robust
visual understanding capabilities of LMMs as seen in natural language
processing. Code and models are available at https://github.com/zamling/PSALM.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要