Improving Visual Grounding with Multi-Scale Discrepancy Information and Centralized-Transformer

Jie Wu,Chunlei Wu,Fuyan Wang,Leiquan Wang,Yiwei Wei

Expert systems with applications（2024）

引用 0|浏览66

暂无评分

摘要

Visual grounding associates linguistic expressions with the corresponding objects or regions in an image. Current methods extract multi-scale features from the image and establish cross-modal relationships through transformers. However, the direct combination of multi-scale features often results in an excess of redundant information, which diminishes the synergistic complementarity across different scales. Furthermore, utilizing transformers to acquire compact multi-modal representations may potentially overlook essential corner regions. In this paper, we propose a unique centralized-transformer network with multi-scale discrepancy information (CTMDI) by exploring multi-scale difference features and performing centralized cross-modal reasoning for precise visual grounding. The multi-scale discrepancy information module calculates the variations of features at different scales to capture fine-grained details and maintain the overall understanding of the visual content. To enhance cross-modal interactions, a centralized transformer is proposed to simultaneously aggregate the local essential information and global distance correlations of multi-modal fusion features. Comprehensive experiments on three typical datasets demonstrate the superiority of CTMDI over existing approaches.

查看译文

关键词

Visual grounding,Discrepancy information,Transformer

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要