Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text Retrieval.

IEEE Transactions on Multimedia(2023)

Cited 9|Views13
No score
Abstract
The problem of video-text retrieval, which searches videos via natural language descriptions or vice versa, has attracted growing attention due to the explosive scale of videos produced every day. The dominant approaches for this problem follow the pipeline that firstly learns compact feature representations of videos and texts, and then jointly embeds them into a common feature space where matched video-text pairs are close and unmatched pairs are far away. However, most of them neither consider the structural similarities among cross-modal samples in a global view, nor leverage useful information from other relevant retrieval processes. We argue that both information has great potential for video-text retrieval. In this paper, we treat the relevant retrieval processes as auxiliary tasks and we extract useful knowledge from them by exploiting structural similarities via Graph Neural Networks (GNNs). We then progressively transfer the knowledge from auxiliary tasks in a general-to-specific manner to assist the main task of the current retrieval process. Specifically, for the retrieval of the given query, we first construct a sequence of query-graphs whose central queries are chosen from distant to close to the given query. Then we conduct knowledge-guided message passing in each query-graph to exploit regional structural similarities and gather knowledge of different levels from the updated query-graphs with a knowledge-based attention mechanism. Finally, we transfer the extracted useful knowledge from general to specific to assist the current retrieval process. Extensive experimental results show that our model outperforms the state-of-the-arts on four benchmarks.
More
Translated text
Key words
Auxiliary tasks, graph neural networks, knowledge transfer, video-text retrieval
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined