Improving Visual Grounding with Multi-Scale Discrepancy Information and Centralized-Transformer
Expert systems with applications(2024)
摘要
Visual grounding associates linguistic expressions with the corresponding objects or regions in an image. Current methods extract multi-scale features from the image and establish cross-modal relationships through transformers. However, the direct combination of multi-scale features often results in an excess of redundant information, which diminishes the synergistic complementarity across different scales. Furthermore, utilizing transformers to acquire compact multi-modal representations may potentially overlook essential corner regions. In this paper, we propose a unique centralized-transformer network with multi-scale discrepancy information (CTMDI) by exploring multi-scale difference features and performing centralized cross-modal reasoning for precise visual grounding. The multi-scale discrepancy information module calculates the variations of features at different scales to capture fine-grained details and maintain the overall understanding of the visual content. To enhance cross-modal interactions, a centralized transformer is proposed to simultaneously aggregate the local essential information and global distance correlations of multi-modal fusion features. Comprehensive experiments on three typical datasets demonstrate the superiority of CTMDI over existing approaches.
更多查看译文
关键词
Visual grounding,Discrepancy information,Transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要