Video question answering via grounded cross-attention network learning

Information Processing & Management(2020)

引用 13|浏览126
暂无评分
摘要
•We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting.•We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q-O cross-attention layer and a Q-V- H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism.•We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method.
更多
查看译文
关键词
Visual information retrieval,Video question answering,Cross-attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要