Semantic Collaborative Learning for Cross-Modal Moment Localization

ACM TRANSACTIONS ON INFORMATION SYSTEMS(2024)

引用 0|浏览12
暂无评分
摘要
Localizing a desired moment within an untrimmed video via a given natural language query, i.e., cross-modal moment localization, has attracted widespread research attention recently. However, it is a challenging task because it requires not only accurately understanding intra-modal semantic information, but also explicitly capturing inter-modal semantic correlations (consistency and complementarity). Existing efforts mainly focus on intra-modal semantic understanding and inter-modal semantic alignment, while ignoring necessary semantic supplement. Consequently, we present a cross-modal semantic perception network for more effective intra-modal semantic understanding and inter-modal semantic collaboration. Concretely, we design a dual-path representation network for intra-modal semantic modeling. Meanwhile, we develop a semantic collaborative network to achieve multi-granularity semantic alignment and hierarchical semantic supplement. Thereby, effective moment localization can be achieved based on sufficient semantic collaborative learning. Extensive comparison experiments demonstrate the promising performance of our model compared with existing state-of-the-art competitors.
更多
查看译文
关键词
Cross-modal moment localization,intra-modal semantic understanding,inter-modal semantic collaboration
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要