How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning

Zijie Song,Wenbo Hu, Hao Ye,Richang Hong

2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023（2023）

Cited 0|Views2

No score

Abstract

Visual Commonsense Reasoning (VCR) task requires Vision and Language Model (VLM) to capture cognitive level clues from the visual-language input and give the right answers to questions and their rationales. Recently, although Pretrained Language Model (PLM) has been taken as a powerful in-domain knowledge base to the various tasks like image segmentation and visual question answering, PLM remains unexplored to generalize to the unseen multi-modal data in an out-domain way. In this paper, we explore how to use PLM to assist VLM for the challenging VCR task and propose a framework called Vision and Language Assisted with Expert Language Model (VLAELM). The VLAELM aims to employ a PLM with expert level of commonsense knowledge to assist reasoning, which is difficult for the VLM learning just from scarce multi-modal data. The experiments show that VLAELM achieves significant improvements against the strong baselines. Moreover, we validate credibility for language expert as knowledge base and measure application value between generalization and specialty in PLM.

Translated text

Key words

visual commonsense reasoning,pre-trained language model,multi-modal fusion,language expert

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined