Attention-Based Relation Reasoning Network for Video-Text Retrieval.

ICME(2021)

引用 6|浏览5
暂无评分
摘要
In the field of video-text matching, there are several potential and effective internal relations within a single modal data, which the existing approaches always ignore. In this paper, we propose a novel model named Attention-based Relation Reasoning Network (ARRN), that can robustly learn and reason the word relations of a sentence and temporal relations between video frames. It can jointly capture the local and global characteristics of video and text, thus significantly improves the performance on video-text retrieval. In ARRN, with global-to-local attention strategy, we could attend to important relations of multi-scales, then learn more reasonable local relation features. These features, generated at distinct levels, are powerful and complementary to each other, allowing us to obtain effective video and text representations by very simple fusion. The extensive experiments on two widely-used video-text datasets MSVD and TGIF show that our proposed ARRN approach establishes a substantial improvement.
更多
查看译文
关键词
Multimedia,video-text retrieval,relation network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要