Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning
CoRR(2024)
摘要
Learning a generalist embodied agent capable of completing multiple tasks
poses challenges, primarily stemming from the scarcity of action-labeled
robotic datasets. In contrast, a vast amount of human videos exist, capturing
intricate tasks and interactions with the physical world. Promising prospects
arise for utilizing actionless human videos for pre-training and transferring
the knowledge to facilitate robot policy learning through limited robot
demonstrations. In this paper, we introduce a novel framework that leverages a
unified discrete diffusion to combine generative pre-training on human videos
and policy fine-tuning on a small number of action-labeled robot videos. We
start by compressing both human and robot videos into unified video tokens. In
the pre-training stage, we employ a discrete diffusion model with a
mask-and-replace diffusion strategy to predict future video tokens in the
latent space. In the fine-tuning stage, we harness the imagined future videos
to guide low-level action learning trained on a limited set of robot data.
Experiments demonstrate that our method generates high-fidelity future videos
for planning and enhances the fine-tuned policies compared to previous
state-of-the-art approaches with superior generalization ability. Our project
website is available at https://video-diff.github.io/.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要