Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning.

IEEE transactions on multimedia（2023）

引用 4|浏览63

暂无评分

摘要

Image and text are dual modalities of our semantic interpretation. Changing images based on text descriptions allows us to imagine and visualize the world (a.k.a. text-based image manipulation (TIM)). In this paper, we introduce a framework that combines TIM with change captioning (CC) and utilizes the benefits of co-training. CC aims to describe what has changed in a scene and can be regarded as the inverse version of TIM where both tasks rely on generative networks. These generative networks can be regarded as data producers of each other and unlike previous methods, we discover that integrating their learning procedures can benefit both. Since the CC module describes differences between two images as text, the CC module can be used as evaluation criteria and provide feedback. Furthermore, we utilize a shared attention mechanism in TIM and CC modules to localize towards prominent regions as well as enabling a change-aware discriminator. In the opposite direction, the output image synthesized by the TIM module can be assessed with the CC module, by checking whether the ground truth text description can be redescribed. Following this insight, not only do we boost the training of the TIM module, but we also utilize the TIM module as additional supervision for the CC training. Experimental results show that our framework outperforms existing TIM methods on several datasets substantially and we achieve marginal improvements in the CC module. To our best knowledge, this is the first study dedicated to the joint training of TIM and CC tasks.

查看译文

关键词

Text-based image manipulation,change captioning,generative networks,GANs,reinforcement learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要