Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Zhong Ji,Zhigang Lin,Haoran Wang,Yanwei Pang,Xuelong Li

Pattern Recognition（2024）

引用 0|浏览0

暂无评分

摘要

Bridging visual and textual representations plays a central role in delving into multimedia data understanding. The main challenge arises from that images and texts exist in heterogeneous spaces, leading to the difficulty to preserve the semantic consistency between both modalities. To narrow the modality gap, most recent methods resort to extra object detectors or parsers to obtain the hierarchical representations. In this work, we address this problem by introducing our Multi-Task Hierarchical Convolutional Neural Network (MT-HCN). It is characterized by mining the hierarchical semantic information without the aid of any extra supervisions. Firstly, from the perspective of representing architecture, we leverage the intrinsic hierarchical structure of Convolutional Neural Networks (CNNs) to decompose the representations of both modalities into two semantically complementary levels, i.e., exterior representations and concept representations. The former focuses on discovering the fine-grained low-level associations between both modalities, meanwhile the latter underlines capturing more high-level abstract semantics. Specifically, we present a Self-Supervised Clustering (SSC) loss to preserve more fine-grained semantic clues in exterior representations. It is constituted on the basis of viewing multiple image/text pairs with similar exterior as a category. In addition, a novel harmonious bidirectional triplet ranking (HBTR) loss is proposed, which mitigate the adverse effects brought about by the biased and noisy negative samples. Besides hardest negatives, it also imposes the constraints on the distance between the positive pairs and the centroid of negative pairs. Extensive experimental results on two popular cross-modal retrieval benchmarks demonstrate our proposed MT-HCN can achieve the competitive results compared with the state-of-the-art methods.

查看译文

关键词

Vision and language,Cross-modal retrieval,Multi-task learning,Metric learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要