GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
CoRR(2024)
摘要
Knowledge-based visual question answering (VQA) requires world knowledge
beyond the image for accurate answer. Recently, instead of extra knowledge
bases, a large language model (LLM) like GPT-3 is activated as an implicit
knowledge engine to jointly acquire and reason the necessary knowledge for
answering by converting images into textual information (e.g., captions and
answer candidates). However, such conversion may introduce irrelevant
information, which causes the LLM to misinterpret images and ignore visual
details crucial for accurate knowledge. We argue that multimodal large language
model (MLLM) is a better implicit knowledge engine than the LLM for its
superior capability of visual understanding. Despite this, how to activate the
capacity of MLLM as the implicit knowledge engine has not been explored yet.
Therefore, we propose GeReA, a generate-reason framework that prompts a MLLM
like InstructBLIP with question relevant vision and language information to
generate knowledge-relevant descriptions and reasons those descriptions for
knowledge-based VQA. Specifically, the question-relevant image regions and
question-specific manual prompts are encoded in the MLLM to generate the
knowledge relevant descriptions, referred to as question-aware prompt captions.
After that, the question-aware prompt captions, image-question pair, and
similar samples are sent into the multi-modal reasoning model to learn a joint
knowledge-image-question representation for answer prediction. GeReA unlocks
the use of MLLM as the implicit knowledge engine, surpassing all previous
state-of-the-art methods on OK-VQA and A-OKVQA datasets, with test accuracies
of 66.5
https://github.com/Upper9527/GeReA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要