How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning

2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023(2023)

Cited 0|Views2
No score
Abstract
Visual Commonsense Reasoning (VCR) task requires Vision and Language Model (VLM) to capture cognitive level clues from the visual-language input and give the right answers to questions and their rationales. Recently, although Pretrained Language Model (PLM) has been taken as a powerful in-domain knowledge base to the various tasks like image segmentation and visual question answering, PLM remains unexplored to generalize to the unseen multi-modal data in an out-domain way. In this paper, we explore how to use PLM to assist VLM for the challenging VCR task and propose a framework called Vision and Language Assisted with Expert Language Model (VLAELM). The VLAELM aims to employ a PLM with expert level of commonsense knowledge to assist reasoning, which is difficult for the VLM learning just from scarce multi-modal data. The experiments show that VLAELM achieves significant improvements against the strong baselines. Moreover, we validate credibility for language expert as knowledge base and measure application value between generalization and specialty in PLM.
More
Translated text
Key words
visual commonsense reasoning,pre-trained language model,multi-modal fusion,language expert
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined