Attention-Based Relation Reasoning Network for Video-Text Retrieval.

N Wang,Z Wang,X Xu,F Shen,Y Yang,HT Shen

ICME（2021）

引用 6|浏览5

暂无评分

摘要

In the field of video-text matching, there are several potential and effective internal relations within a single modal data, which the existing approaches always ignore. In this paper, we propose a novel model named Attention-based Relation Reasoning Network (ARRN), that can robustly learn and reason the word relations of a sentence and temporal relations between video frames. It can jointly capture the local and global characteristics of video and text, thus significantly improves the performance on video-text retrieval. In ARRN, with global-to-local attention strategy, we could attend to important relations of multi-scales, then learn more reasonable local relation features. These features, generated at distinct levels, are powerful and complementary to each other, allowing us to obtain effective video and text representations by very simple fusion. The extensive experiments on two widely-used video-text datasets MSVD and TGIF show that our proposed ARRN approach establishes a substantial improvement.

查看译文

关键词

Multimedia,video-text retrieval,relation network

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要