Prompt Decoupling for Text-to-Image Person Re-identification
CoRR(2024)
摘要
Text-to-image person re-identification (TIReID) aims to retrieve the target
person from an image gallery via a textual description query. Recently,
pre-trained vision-language models like CLIP have attracted significant
attention and have been widely utilized for this task due to their robust
capacity for semantic concept learning and rich multi-modal knowledge. However,
recent CLIP-based TIReID methods commonly rely on direct fine-tuning of the
entire network to adapt the CLIP model for the TIReID task. Although these
methods show competitive performance on this topic, they are suboptimal as they
necessitate simultaneous domain adaptation and task adaptation. To address this
issue, we attempt to decouple these two processes during the training stage.
Specifically, we introduce the prompt tuning strategy to enable domain
adaptation and propose a two-stage training approach to disentangle domain
adaptation from task adaptation. In the first stage, we freeze the two encoders
from CLIP and solely focus on optimizing the prompts to alleviate domain gap
between the original training data of CLIP and downstream tasks. In the second
stage, we maintain the fixed prompts and fine-tune the CLIP model to prioritize
capturing fine-grained information, which is more suitable for TIReID task.
Finally, we evaluate the effectiveness of our method on three widely used
datasets. Compared to the directly fine-tuned approach, our method achieves
significant improvements.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要