Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 34|浏览150
暂无评分
摘要
Cross-model retrieval has attracted much attention in recent years due to its wide applications. Conventional approaches usually take one modality as query to retrieve relevant data of another modality. In this paper, we devote to an emerging task in cross-modal retrieval, Composing Text and Image to Image Retrieval (CTI-IR), which aims at retrieving images relevant to a query image with text describing desired modifications to the query image. Compared with conventional cross-modal retrieval, the new task is particularly useful for the retrieval that the query image does not perfectly match the user's expectations. Generally, the CTI-IR involves two underlying problems: how to manipulate visual features of the query image specified by the text, and how to model the modality gap between the query and target. Most previous methods focus on solving the second problem. In this paper, we aim to deal with both problems simultaneously in a unified model. Specifically, the proposed method is based on the graph attention network and adversarial learning network, which enjoys several merits. First, the query image and the modification text are constructed in a relation graph for learning text-adaptive representations. Second, semantic contents from the text are injected into the visual features through graph attention. Third, an adversarial loss is incorporated into the conventional cross-modal retrieval loss to learn more discriminative modality invariant representations for CTI-IR. Extensive experiments on three benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要