Video question answering via grounded cross-attention network learning

Yunan Ye,Shifeng Zhang,Yimeng Li,Xufeng Qian,Siliang Tang,Shiliang Pu,Jun Xiao

Information Processing & Management（2020）

引用 13|浏览126

暂无评分

摘要

•We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting.•We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q-O cross-attention layer and a Q-V- H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism.•We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method.

查看译文

关键词

Visual information retrieval,Video question answering,Cross-attention

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要