DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME(2023)

引用 0|浏览17
暂无评分
摘要
Inspired by CLIP's excellent image/text representation capability and StyleGAN's disentangled latent space, text-guide image editing techniques make significant progress. However, as CLIP cannot perform local fine-grained image/text alignment, existing methods suffer from entanglement problems. Moreover, there lacks a deep interaction between textual tokens and visual features, which may lead to unfaithful editing results. In this paper, we propose DF-CLIP for Disentangled and Fine-grained text-guide image editing. Specifically, we design a novel dual-branch LatentMask module to generate more accurate editing directions in StyleGAN's latent space, which can avoid changes in text-unrelated areas. Furthermore, we present a Multimodal Interaction module to associate the text embedding with the image embedding and perform a deep interaction between them, which greatly enhance the guidance of text in image editing process and accelerate the training convergence. Extensive experiments show that our models perform more disentangled and natural editing results with a shorter training time.
更多
查看译文
关键词
CLIP, StyleGAN, Text-guide Image Editing, LatentMask, Multi-modal Interaction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要