How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning

2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023(2023)

引用 0|浏览0
暂无评分
摘要
Visual Commonsense Reasoning (VCR) task requires Vision and Language Model (VLM) to capture cognitive level clues from the visual-language input and give the right answers to questions and their rationales. Recently, although Pretrained Language Model (PLM) has been taken as a powerful in-domain knowledge base to the various tasks like image segmentation and visual question answering, PLM remains unexplored to generalize to the unseen multi-modal data in an out-domain way. In this paper, we explore how to use PLM to assist VLM for the challenging VCR task and propose a framework called Vision and Language Assisted with Expert Language Model (VLAELM). The VLAELM aims to employ a PLM with expert level of commonsense knowledge to assist reasoning, which is difficult for the VLM learning just from scarce multi-modal data. The experiments show that VLAELM achieves significant improvements against the strong baselines. Moreover, we validate credibility for language expert as knowledge base and measure application value between generalization and specialty in PLM.
更多
查看译文
关键词
visual commonsense reasoning,pre-trained language model,multi-modal fusion,language expert
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要