Attentive Snippet Prompting for Video Retrieval

Siran Chen, Qinglin Xu, Yue Ma,Yu Qiao,Yali Wang


引用 0|浏览4
The recent advance of video retrieval has been driven by large-scale visual-language pretraining models. In particular, the state-of-the-art approaches are mainly based on temporal extension of the well-known CLIP model. However, they ignore a critical problem in video retrieval, i.e., the text often refers to a small snippet in the corresponding video. Blindly aggregating all the frames inevitably reduces the discriminative capacity of the final video token to match the text token. Hence, these approaches are limited to retrieve complex videos with diversified contents. To tackle this problem, we propose a concise and novel Attentive Snippet Prompting (ASP) framework, which can dynamically exploit the text-relevant video snippet to boost retrieval. Specifically, our ASP consists of two simple but effective modules, i.e., snippet prompting and video aggregating. Given a pair of text and video, snippet prompting can smartly use cross-modal attention to construct a text-driven visual prompt, namely attentive snippet token, which adaptively describes the relevant video snippet of the text query. Alternatively, video aggregating can summarize all the frame tokens as a video token, for providing the global context. With cooperation of attentive snippet token and global video token, our ASP can effectively learn a robust and text-relevant visual representation for video retrieval. Finally, we evaluate our ASP framework on the widely-used benchmarks, where it simply outperforms a number of recent approaches with a large margin.
Video retrieval,prompting,attention,multi-modal learning
AI 理解论文
Chat Paper