Improving Referring Image Segmentation using Vision-Aware Text Features
CoRR(2024)
摘要
Referring image segmentation is a challenging task that involves generating
pixel-wise segmentation masks based on natural language descriptions. Existing
methods have relied mostly on visual features to generate the segmentation
masks while treating text features as supporting components. This over-reliance
on visual features can lead to suboptimal results, especially in complex
scenarios where text prompts are ambiguous or context-dependent. To overcome
these challenges, we present a novel framework VATEX to improve referring image
segmentation by enhancing object and context understanding with Vision-Aware
Text Feature. Our method involves using CLIP to derive a CLIP Prior that
integrates an object-centric visual heatmap with text description, which can be
used as the initial query in DETR-based architecture for the segmentation task.
Furthermore, by observing that there are multiple ways to describe an instance
in an image, we enforce feature similarity between text variations referring to
the same visual input by two components: a novel Contextual Multimodal Decoder
that turns text embeddings into vision-aware text features, and a Meaning
Consistency Constraint to ensure further the coherent and consistent
interpretation of language expressions with the context understanding obtained
from the image. Our method achieves a significant performance improvement on
three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at:
https://nero1342.github.io/VATEX_RIS.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要