Attentive Snippet Prompting for Video Retrieval

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览9
暂无评分
摘要
The recent advance of video retrieval has been driven by large-scale visual-language pretraining models. In particular, the state-of-the-art approaches are mainly based on temporal extension of the well-known CLIP model. However, they ignore a critical problem in video retrieval, i.e., the text often refers to a small snippet in the corresponding video. Blindly aggregating all the frames inevitably reduces the discriminative capacity of the final video token to match the text token. Hence, these approaches are limited to retrieve complex videos with diversified contents. To tackle this problem, we propose a concise and novel Attentive Snippet Prompting (ASP) framework, which can dynamically exploit the text-relevant video snippet to boost retrieval. Specifically, our ASP consists of two simple but effective modules, i.e., snippet prompting and video aggregating. Given a pair of text and video, snippet prompting can smartly use cross-modal attention to construct a text-driven visual prompt, namely attentive snippet token, which adaptively describes the relevant video snippet of the text query. Alternatively, video aggregating can summarize all the frame tokens as a video token, for providing the global context. With cooperation of attentive snippet token and global video token, our ASP can effectively learn a robust and text-relevant visual representation for video retrieval. Finally, we evaluate our ASP framework on the widely-used benchmarks, where it simply outperforms a number of recent approaches with a large margin.
更多
查看译文
关键词
Video retrieval,prompting,attention,multi-modal learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要