Online Asymmetric Metric Learning with Multi-Layer Similarity Aggregation for Cross-Modal Retrieval.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society(2019)

引用 29|浏览56
暂无评分
摘要
Cross-modal retrieval has attracted intensive attention in recent years, where a substantial yet challenging problem is how to measure the similarity between heterogeneous data modalities. Despite of using modality-specific representation learning techniques, most existing shallow or deep models treat different modalities equally, and neglect the intrinsic modality heterogeneity and information imbalance among modalities such as images and texts. In this paper, we propose an online similarity function learning framework to learn the metric that can well reflect the cross-modal semantic relation. Considering that multiple CNN feature layers naturally represent visual information from low level visual patterns to high level semantic abstraction, we propose a new asymmetric image-text similarity formulation which aggregates the layer-wise visual-textual similarities parameterized by different bilinear parameter matrices. To effectively learn the aggregated similarity function, we develop three different similarity combination strategies, i.e., averaging kernel, multiple kernel learning and layer gating. The former two kernel-based strategies assign uniform weights on different layers to all data pairs, the latter works on the original feature representation and assigns instance-aware weights on different layers to different data pairs, and they are all learned by preserving the bi-directional relative similarity expressed by a large number of cross-modal training triplets. Experiments conducted on three public datasets well demonstrate the effectiveness of our method.
更多
查看译文
关键词
Kernel,Visualization,Measurement,Semantics,Training,Feature extraction,Correlation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要