Commonsense-Guided Semantic and Relational Consistencies for Image-Text Retrieval

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览11
暂无评分
摘要
Image-text retrieval, as a fundamental task in the cross-modal field, aims to explore the relationship between visual and textual modalities. Recent methods address this task only by learning the conceptual and syntactical correspondences between cross-modal fragments, but these correspondences inevitably contain noise without considering external knowledge. To solve this issue, we propose a novel Commonsense-Guided Semantic and Relational Consistencies (CSRC) for image-text retrieval that can simultaneously expand the semantics and relations to reduce the cross-modal differences under the assumption that the semantics and relations of the true image-text pair should be consistent between two modalities. Specifically, we first explore commonsense knowledge to expand the specific concepts for visual and textual graphs and optimize the semantic consistency by minimizing the differences in cross-modal semantic importance. Then, we extend the same relations for cross-modal concept pairs with semantic consistency, which serves to implement relational consistency. After that, we combine external commonsense knowledge with internal correlation to enhance concept representation and further optimize relational consistency by regularizing the importance differences between association-enhanced concepts. Extensive experimental results on two popular image-text retrieval datasets demonstrate the effectiveness of our proposed method.
更多
查看译文
关键词
Semantics,Visualization,Correlation,Task analysis,Commonsense reasoning,Oceans,Sea measurements,Commonsense knowledge,semantic consistency,relational consistency,image-text retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要