CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos.

Wen Wang,Ling Zhong,Guang Gao,Minhong Wan,Jason Gu

ICME（2023）

引用 0|浏览4

暂无评分

摘要

The goal of temporal language grounding (TLG) task is to temporally localize the most semantically matched video segment with respect to a given sentence query in an untrimmed video. How to effectively incorporate the cross-modal interactions between video and language is the key to improve grounding performance. Previous approaches focus on learning correlations by computing the attention matrix between each frame-word pair, while ignoring the global semantics conditioned on one modality for better associating the complex video contents and sentence query of the target modality. In this paper, we propose a novel Cross-modal Hybrid Attention Network, which integrates two parallel attention fusion modules to exploit the semantics of each modality and interactions in cross modalities. One is Intra-Modal Attention Fusion, which utilizes gated self-attention to capture the frame-by-frame and word-by-word relations conditioned on the other modality. The other is Inter-Modal Attention Fusion, which utilizes query and key features derived from different modalities to calculate the co-attention weights and further promote inter-modal fusion. Experimental results show that our CHAN significantly outperforms several existing state-of-the-arts on three challenging datasets (ActivityNet Captions, Charades-STA and TACOS), demonstrating the effectiveness of our proposed method.

查看译文

关键词

Temporal language grounding, cross-modal fusion, gated self-attention, co-attention

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要