Implicit and Explicit Language Guidance for Diffusion-based Visual Perception
arxiv(2024)
摘要
Text-to-image diffusion models have shown powerful ability on conditional
image synthesis. With large-scale vision-language pre-training, diffusion
models are able to generate high-quality images with rich texture and
reasonable structure under different text prompts. However, it is an open
problem to adapt the pre-trained diffusion model for visual perception. In this
paper, we propose an implicit and explicit language guidance framework for
diffusion-based perception, named IEDP. Our IEDP comprises of an implicit
language guidance branch and an explicit language guidance branch. The implicit
branch employs frozen CLIP image encoder to directly generate implicit text
embeddings that are fed to diffusion model, without using explicit text
prompts. The explicit branch utilizes the ground-truth labels of corresponding
images as text prompts to condition feature extraction of diffusion model.
During training, we jointly train diffusion model by sharing the model weights
of these two branches. As a result, implicit and explicit branches can jointly
guide feature learning. During inference, we only employ implicit branch for
final prediction, which does not require any ground-truth labels. Experiments
are performed on two typical perception tasks, including semantic segmentation
and depth estimation. Our IEDP achieves promising performance on both tasks.
For semantic segmentation, our IEDP has the mIoU score of 55.9
validation set, which outperforms the baseline method VPD by 2.2
estimation, our IEDP outperforms the baseline method VPD with a relative gain
of 10.2
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要