Deep Instruction Tuning for Segment Anything Model
CoRR(2024)
Abstract
Segment Anything Model (SAM) exhibits powerful yet versatile capabilities on
(un) conditional image segmentation tasks recently. Although SAM can support
various segmentation prompts, we note that, compared to point- and box-guided
segmentation, it performs much worse on text-instructed tasks. We argue that
deep text instruction tuning is key to mitigate such shortcoming caused by the
shallow fusion scheme in its default light-weight mask decoder. In this paper,
two deep instruction tuning (DIT) methods are proposed, one is
end-to-end and the other is layer-wise. With these tuning methods, we can
regard the image encoder of SAM as a stand-alone vision-language learner in
contrast to building another deep fusion branch. Extensive experiments on three
highly competitive benchmark datasets of referring image segmentation show that
a simple end-to-end DIT improves SAM by a large margin, with layer-wise DIT
further boosts the performance to state-of-the-art. Our code is anonymously
released at: https://github.com/wysnzzzz/DIT.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined