谷歌浏览器插件
订阅小程序
在清言上使用

Improving Visual Grounding with Multi-Scale Discrepancy Information and Centralized-Transformer

Expert systems with applications(2024)

引用 0|浏览66
暂无评分
摘要
Visual grounding associates linguistic expressions with the corresponding objects or regions in an image. Current methods extract multi-scale features from the image and establish cross-modal relationships through transformers. However, the direct combination of multi-scale features often results in an excess of redundant information, which diminishes the synergistic complementarity across different scales. Furthermore, utilizing transformers to acquire compact multi-modal representations may potentially overlook essential corner regions. In this paper, we propose a unique centralized-transformer network with multi-scale discrepancy information (CTMDI) by exploring multi-scale difference features and performing centralized cross-modal reasoning for precise visual grounding. The multi-scale discrepancy information module calculates the variations of features at different scales to capture fine-grained details and maintain the overall understanding of the visual content. To enhance cross-modal interactions, a centralized transformer is proposed to simultaneously aggregate the local essential information and global distance correlations of multi-modal fusion features. Comprehensive experiments on three typical datasets demonstrate the superiority of CTMDI over existing approaches.
更多
查看译文
关键词
Visual grounding,Discrepancy information,Transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要