Few-shot Adaptation of Multi-modal Foundation Models: A Survey
CoRR(2024)
摘要
Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (e.g., ImageNet-based pre-training) as the new
generation of visual foundation models. These models with robust and aligned
semantic representations learned from billions of internet image-text pairs and
can be applied to various downstream tasks in a zero-shot manner. However, in
some fine-grained domains like medical imaging and remote sensing, the
performance of multi-modal foundation models often leaves much to be desired.
Consequently, many researchers have begun to explore few-shot adaptation
methods for these models, gradually deriving three main technical approaches:
1) prompt-based methods, 2) adapter-based methods, and 3) external
knowledge-based methods. Nevertheless, this rapidly developing field has
produced numerous results without a comprehensive survey to systematically
organize the research progress. Therefore, in this survey, we introduce and
analyze the research advancements in few-shot adaptation methods for
multi-modal models, summarizing commonly used datasets and experimental setups,
and comparing the results of different methods. In addition, due to the lack of
reliable theoretical support for existing methods, we derive the few-shot
adaptation generalization error bound for multi-modal models. The theorem
reveals that the generalization error of multi-modal foundation models is
constrained by three factors: domain gap, model capacity, and sample size.
Based on this, we propose three possible solutions from the following aspects:
1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive
knowledge utilization.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要