Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval

Conference on Information and Knowledge Management(2021)

引用 8|浏览29
暂无评分
摘要
ABSTRACTIn this paper, we tackle the cross-lingual language-to-vision (CLLV) retrieval task. In the CLLV retrieval task, given the text query in one language, it seeks to retrieve the relevant images/videos from the database based on visual content in images/videos and their captions in another language. As the CLLV retrieval bridges the modal gap and the language gap, it makes many international cross-modal applications feasible. To tackle the CLLV retrieval, in this paper, we propose an assorted attention network (A2N) to synchronously overcome the language gap, bridge the modal gap and fuse features of two modals in an elegant and effective manner. It represents each text query as a set of word features and represents each image/video as a set of its caption's word features in another language and a set of its local visual features. In this case, the relevance between the text query and the image/video is obtained by the matching between the set of query's word features and two sets of image/video features. To enhance the effectiveness of the matching, A2N merges the query's word features and the image/video's visual and word features into an assorted set and further conducts the self-attention operation on items of the assorted set. On one hand, benefited from the attentions between the query's word features and the video/image's visual features, some important word features or visual features of the image/video can be emphasized. On the other hand, benefited from the attentions between the video/image's visual features and its caption word features, the image/video's visual content and the text information can be fused in a more effective manner. Systematic experiments conducted on four datasets demonstrate the effectiveness of the proposed A2N in the CLLV retrieval task.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要