Feature Fusion Based on Transformer for Cross-modal Retrieval

Guihao Zhang,Jiangzhong Cao

Journal of physics(2023)

引用 0|浏览1
暂无评分
摘要
Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and text, in order to build a bridge of semantic association between the two heterogeneous data, so as to achieve unified alignment and retrieval. The current mainstream image-text cross-modal retrieval approaches have made good progress by designing a deep learning-based model to find potential associations between different modal data. In this paper, we design a transformer-based feature fusion network to fuse the information of two modalities in the feature extraction process, which can enrich the semantic connection between the modalities. Meanwhile, we conduct experiments on the benchmark dataset Flickr30k and get competitive results, where recall at 10 achieves 96.2% accuracy in image-to-text retrieval.
更多
查看译文
关键词
transformer,feature,retrieval,cross-modal
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要