Multi-Modality Cross Attention Network for Image and Sentence Matching

CVPR(2020)

引用 287|浏览970
暂无评分
摘要
The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel MultiModality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.
更多
查看译文
关键词
intra-modality relationship,inter-modality relationship,image regions,sentence words,cross-modal matching task,image matching methods,sentence matching methods,multimodality cross attention network,image matching,cross-attention mechanism,Flickr30K,MS-COCO
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要