Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training

Xuemei Zhang, Peng Zhao,Jinsheng Ji,Xiankai Lu,Yilong Yin

IEEE Transactions on Circuits and Systems for Video Technology（2023）

引用 0|浏览3

暂无评分

摘要

As a new emerging task, video corpus moment retrieval (VCMR) aims to find the video segments relevant to a given natural language query from a large number of untrimmed videos. It mainly includes two subtasks, finding the most relevant video based on the query text (video retrieval), and locating the segment most relevant to a given query in a video (moment localization). At the same time, since videos often contain rich multi-modal information such as audio, text, and images, how to align and interact with the multi-modal information of videos and the text information of natural language queries across modalities is the core issue of this task. This article proposes a Deformable Multigranularity Feature Fusion with Adversarial Training Network (DMFAT), first inputs the subtitle and frame multi-modal information of the video into our Multi-Scale Deformable Attention module and performs multi-granularity feature fusion through Deformable Attention respectively. Then, guided by the query, adaptive weights are generated to fuse the two multi-granularity modality features of the video. Finally, the cross-modal representation of the query and video features is obtained through a bidirectional attention module, and an adversarial contrastive learning objective is introduced to enhance more precise moment localization. Our model is evaluated on two representative video corpus moment retrieval benchmarks: TVR and DiDeMo. Extensive experiments have been conducted to demonstrate that our method outperforms existing work.

查看译文

关键词

Video Corpus Moment Retrieval,Deformable Attention,Multi-granularity,Adversarial Training

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要