FERNIE-ViL: Facial Expression Enhanced Vision-and-Language Model

2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)(2021)

引用 0|浏览6
暂无评分
摘要
Visual cognition requires analyzing actions, intentions, and emotions of persons in a given image. Visual Commonsense Reasoning (VCR) is a task that selects rationales and answers to questions for given images. In VCR, facial expressions are important nonverbal signals because they convey emotions and intentions in human interactions. However, ERNIE-ViL and UNITER, which are vision-and-language models to get image and text representations, do not learn them. We find that ERNIE-ViL and UNITER are vulnerable to the problem of identifying emotions. In this paper, therefore, we propose facial expression recognition FERNIE-ViL, which adapts a facial expression recognition module to the existing vision-and-language model. Experimental results (2.4% point improvement on VCR Q→A and 0.3% point improvement on VCR QA→R) demonstrate that our method can enhance visual commonsense reasoning by understanding human interactions.
更多
查看译文
关键词
Artificial Intelligence,Machine Commonsense,Commonsense Reasoning,Multi-modal,Facial Expression,Natural Language Processing,Visual Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要