Discovering Multimodal Hierarchical Structures with Graph Neural Networks for Multi-modal and Multi-hop Question Answering

Qing Zhang, Haocheng Lv,Jie Liu, Zhiyun Chen,Jianyong Duan, Mingying Xv,Hao Wang

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I(2024)

引用 0|浏览4
暂无评分
摘要
Multimodal reasoning is a challenging task that requires understanding and integrating information from different modalities, such as text and image. Existing methods for multimodal reasoning often fail to capture the rich structural information among visual and textual semantics in different modalities, which is crucial for generating accurate answers. In this paper, we propose a novel method that leverages graph neural networks to model the structural information to enhance multimodal reasoning. Specifically, we first use a Multimodal and Multi-hop reader to attend to different chunks in the context based on the question, and then search for multi-hop candidate tokens within these chunks. Next, we construct a graph to represent the relations among the chunks. Then we apply a Sparse Matrix-Tree algorithm to learn a hierarchical informative structure. Then, we use a Hierarchy-aware Message Passing mechanism to perform multi-hop reasoning on the selected edges and update the node representations. Finally, we use a graph-selection decoder to generate the answer based on the structure-enriched chunk representation. We conduct experiments on the WebQA dataset, which is a large-scale multimodal question answering dataset [1]. The results show that our method outperforms the baseline methods in terms of reasoning and the overall answer accuracy. We also provide some qualitative analysis to illustrate how our method benefits from the structural information among different modalities.
更多
查看译文
关键词
Multimodal,Multi-hop reasoning,Graph Neural Networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要