Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning.

Mingyan Wu,Shuhan Qi,Jun Rao,Jiajia Zhang,Qing Liao,Xuan Wang,Xinxin Liao

Trustworthy AI @ ACM Multimedia（2021）

引用 1|浏览27

暂无评分

摘要

Visual commonsense reasoning (VCR) task aims at boosting research of cognition-level correlations reasoning. It requires not only a thorough understanding of correlated details of the scene but also the ability to infer correlation with related commonsense knowledge. Existing approaches consider the region-word affinity to perform the semantic alignment between vision and linguistic domains, which neglect the implicit correspondence (e.g. word-scene, region-phrase, and phrase-scene) among visual concepts and linguistic words. Although efforts have been made to deliver promising results in previous work, these methods are still confronted with challenges when comes to make interpretable reasoning. Toward this end, we present a novel hierarchical semantic enhanced directional graph network. To be more specific, we design a Modality Interaction Unit (MIU) module, which captures high-order cross-modal alignment by aggregating the hierarchical vision-language relationships. Afterward, we propose a direction clue-aware graph reasoning (DCGR) module. In this module, valuable entities can be dynamically selected in each reasoning step, according to the importance of these entities. This leads to a more interpretable reasoning procedure. Ultimately, heterogeneous graph attention is introduced to filter the irrelevant parts of the final answers. Extensive experiments have been conducted on the VCR benchmark dataset, which demonstrates that our method can achieve competitive results and better interpretability compared with several state-of-the-art baselines.

查看译文

关键词

visual commonsense reasoning,directional

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要