LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions
arxiv(2024)
摘要
Controllable text-to-image generation synthesizes visual text and objects in
images with certain conditions, which are frequently applied to emoji and
poster generation. Visual text rendering and layout-to-image generation tasks
have been popular in controllable text-to-image generation. However, each of
these tasks typically focuses on single modality generation or rendering,
leaving yet-to-be-bridged gaps between the approaches correspondingly designed
for each of the tasks. In this paper, we combine text rendering and
layout-to-image generation tasks into a single task: layout-controllable
text-object synthesis (LTOS) task, aiming at synthesizing images with object
and visual text based on predefined object layout and text contents. As
compliant datasets are not readily available for our LTOS task, we construct a
layout-aware text-object synthesis dataset, containing elaborate well-aligned
labels of visual text and object information. Based on the dataset, we propose
a layout-controllable text-object adaptive fusion (TOF) framework, which
generates images with clear, legible visual text and plausible objects. We
construct a visual-text rendering module to synthesize text and employ an
object-layout control module to generate objects while integrating the two
modules to harmoniously generate and integrate text content and objects in
images. To better the image-text integration, we propose a self-adaptive
cross-attention fusion module that helps the image generation to attend more to
important text information. Within such a fusion module, we use a self-adaptive
learnable factor to learn to flexibly control the influence of cross-attention
outputs on image generation. Experimental results show that our method
outperforms the state-of-the-art in LTOS, text rendering, and layout-to-image
tasks, enabling harmonious visual text rendering and object generation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要