MSGeN: Multimodal Selective Generation Network for Grounded Explanations

Dingbang Li, Wenzhou Chen,Xin Lin

ELECTRONICS（2024）

引用 0|浏览2

暂无评分

摘要

Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step's explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN's ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies.

查看译文

关键词

visual question answering,explanation generation,multimodal,vision and language

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要