CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos.
ICME(2023)
摘要
The goal of temporal language grounding (TLG) task is to temporally localize the most semantically matched video segment with respect to a given sentence query in an untrimmed video. How to effectively incorporate the cross-modal interactions between video and language is the key to improve grounding performance. Previous approaches focus on learning correlations by computing the attention matrix between each frame-word pair, while ignoring the global semantics conditioned on one modality for better associating the complex video contents and sentence query of the target modality. In this paper, we propose a novel Cross-modal Hybrid Attention Network, which integrates two parallel attention fusion modules to exploit the semantics of each modality and interactions in cross modalities. One is Intra-Modal Attention Fusion, which utilizes gated self-attention to capture the frame-by-frame and word-by-word relations conditioned on the other modality. The other is Inter-Modal Attention Fusion, which utilizes query and key features derived from different modalities to calculate the co-attention weights and further promote inter-modal fusion. Experimental results show that our CHAN significantly outperforms several existing state-of-the-arts on three challenging datasets (ActivityNet Captions, Charades-STA and TACOS), demonstrating the effectiveness of our proposed method.
更多查看译文
关键词
Temporal language grounding, cross-modal fusion, gated self-attention, co-attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要