Area-keywords cross-modal alignment for referring image segmentation

Huiyong Zhang,Lichun Wang, Shuang Li,Kai Xu,Baocai Yin

Neurocomputing(2024)

引用 0|浏览3
暂无评分
摘要
Referring image segmentation aims to segment the instance corresponding to the given language description, which requires aligning information from two modalities. Existing approaches usually align the cross-modal information based on different forms of feature units, such as pixel-sentence, pixel-word and patch-word. The problem is that the semantic information embodied by these feature units may be mismatched, for example, the semantics transferred by a pixel is a part of the semantics of a sentence. When using this inconsistent information to model the relationship between feature units from two modalities, the obtained relationship between the modes are imprecise, resulting in inaccurate cross-modal features. In this paper, we propose to generate scalable area and keywords features to ensure that the feature units from the two modalities have comparable semantic granularity. Meanwhile, the scalable features provide sparse representation for image and text, which reduces computation complexity for computing cross-modal features. In addition, we design a novel multi-source driven dynamic convolution to inversely map the area-keywords cross-modal features to image to predicate mask. The experimental results on three benchmark datasets demonstrate that our proposed framework achieves advanced performance, and the calculation amount of the model has been greatly reduced.
更多
查看译文
关键词
Referring image segmentation,Cross-modal alignment,Dynamic convolution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要