DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
CoRR(2024)
摘要
We propose DINOBot, a novel imitation learning framework for robot
manipulation, which leverages the image-level and pixel-level capabilities of
features extracted from Vision Transformers trained with DINO. When interacting
with a novel object, DINOBot first uses these features to retrieve the most
visually similar object experienced during human demonstrations, and then uses
this object to align its end-effector with the novel object to enable effective
interaction. Through a series of real-world experiments on everyday tasks, we
show that exploiting both the image-level and pixel-level properties of vision
foundation models enables unprecedented learning efficiency and generalisation.
Videos and code are available at https://www.robot-learning.uk/dinobot.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要