Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
arxiv(2024)
摘要
Embodied AI agents require a fine-grained understanding of the physical world
mediated through visual and language inputs. Such capabilities are difficult to
learn solely from task-specific data. This has led to the emergence of
pre-trained vision-language models as a tool for transferring representations
learned from internet-scale data to downstream tasks and new domains. However,
commonly used contrastively trained representations such as in CLIP have been
shown to fail at enabling embodied agents to gain a sufficiently fine-grained
scene understanding – a capability vital for control. To address this
shortcoming, we consider representations from pre-trained text-to-image
diffusion models, which are explicitly optimized to generate images from text
prompts and as such, contain text-conditioned representations that reflect
highly fine-grained visuo-spatial information. Using pre-trained text-to-image
diffusion models, we construct Stable Control Representations which allow
learning downstream control policies that generalize to complex, open-ended
environments. We show that policies learned using Stable Control
Representations are competitive with state-of-the-art representation learning
approaches across a broad range of simulated control settings, encompassing
challenging manipulation and navigation tasks. Most notably, we show that
Stable Control Representations enable learning policies that exhibit
state-of-the-art performance on OVMM, a difficult open-vocabulary navigation
benchmark.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要