Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

International Multimedia Conference(2021)

引用 41|浏览48
暂无评分
摘要
ABSTRACTVideo moment retrieval aims to localize the most relevant video moment given the text query. Weakly supervised approaches leverage video-text pairs only for training, without temporal annotations. Most current methods align the proposed video moment and the text in a joint embedding space. However, in lack of temporal annotations, the semantic gap between these two modalities makes it predominant to learn joint feature representation for most methods, with less emphasis on learning visual feature representation. This paper aims to improve the visual feature representation with supervisions in the visual domain, obtaining discriminative visual features for cross-modal learning. Based on the observation that relevant video moments (i.e., share similar activities) from different videos are commonly described by similar sentences; hence the visual features of these relevant video moments should also be similar despite that they come from different videos. Therefore, to obtain more discriminative and robust visual features for video moment retrieval, we propose to align the visual features of relevant video moments from different videos that co-occurred in the same training batch. Besides, a contrastive learning approach is introduced for learning the moment-level alignment of these videos. Through extensive experiments, we demonstrate that the proposed visual co-occurrence alignment learning method outperforms the cross-modal alignment learning counterpart and achieves promising results for video moment retrieval.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要