Can MLLMs Perform Text-to-Image In-Context Learning?
CoRR(2024)
摘要
The evolution from Large Language Models (LLMs) to Multimodal Large Language
Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to
its multimodal counterpart. Existing such studies have primarily concentrated
on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique
characteristics and potential applications, remains underexplored. To address
this gap, we formally define the task of T2I-ICL and present CoBSAT, the first
T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to
benchmark six state-of-the-art MLLMs, we uncover considerable difficulties
MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the
inherent complexity of multimodality and image generation. To overcome these
challenges, we explore strategies like fine-tuning and Chain-of-Thought
prompting, demonstrating notable improvements. Our code and dataset are
available at .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要