CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
CoRR(2024)
摘要
When exploring the development of Artificial General Intelligence (AGI), a
critical task for these models involves interpreting and processing information
from multiple image inputs. However, Large Multimodal Models (LMMs) encounter
two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a
tendency to blend information across multiple images. We first extensively
investigate the capability of LMMs to perceive fine-grained visual details when
dealing with multiple input images. The research focuses on two aspects: first,
image-to-image matching (to evaluate whether LMMs can effectively reason and
pair relevant images), and second, multi-image-to-text matching (to assess
whether LMMs can accurately capture and summarize detailed image information).
We conduct evaluations on a range of both open-source and closed-source large
models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model
performance, we further develop a Contrastive Chain-of-Thought (CoCoT)
prompting approach based on multi-input multimodal models. This method requires
LMMs to compare the similarities and differences among multiple image inputs,
and then guide the models to answer detailed questions about multi-image inputs
based on the identified similarities and differences. Our experimental results
showcase CoCoT's proficiency in enhancing the multi-image comprehension
capabilities of large multimodal models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要