Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 54|浏览143
暂无评分
摘要
Grounding objects in visual context from natural language queries is a crucial yet challenging vision-and-language task, which has gained increasing attention in recent years. Existing work has primarily investigated this task in the context of still images. Despite their effectiveness, these methods cannot be directly migrated into the video context, mainly due to 1) the complex spatio-temporal structure of videos and 2) the scarcity of fine-grained annotations of videos. To effectively ground objects in videos is profoundly more challenging and less explored. To fill the research gap, this paper presents a weakly-supervised framework for linking objects mentioned in a sentence with the corresponding regions in videos. It mainly considers two types of video characteristics: 1) objects are dynamically distributed across multiple frames and have diverse temporal durations, and 2) object regions in videos are spatially correlated with each other. Specifically, we propose a weakly-supervised video object grounding approach which mainly consists of three modules: 1) a temporal localization module to model the latent relation between queried objects and frames with a temporal attention network, 2) a spatial interaction module to capture feature correlation among object regions for learning context-aware region representation, and 3) a hierarchical video multiple instance learning algorithm to estimate the sentence-segment grounding score for discriminative training. Extensive experiments demonstrate that our method can achieve consistent improvement over the state-of-the-arts.
更多
查看译文
关键词
Cross-modal retrieval, weakly-supervised learning, video object grounding, vision and language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要