DROID: Learning from Offline Heterogeneous Demonstrations Via Reward-Policy Distillation.

Sravan Jayanthi,Letian Chen, Nadya Balabanska, Van Duong, Erik Scarlatescu, Ezra Ameperosa,Zulfiqar Zaidi,Daniel Martin, Taylor Del Matto, Masahiro Ono,Matthew Gombolay

CONFERENCE ON ROBOT LEARNING, VOL 229（2023）

Cited 0|Views4

No score

Abstract

Offline Learning from Demonstrations (OLfD) is valuable in domains where trial-and-error learning is infeasible or specifying a cost function is difficult, such as robotic surgery, autonomous driving, and path-finding for NASA's Mars rovers. However, two key problems remain challenging in OLfD: 1) heterogeneity: demonstration data can be generated with diverse preferences and strategies, and 2) generalizability: the learned policy and reward must perform well in unseen test settings beyond the limited training regime. To overcome these challenges, we propose Dual Reward and policy Offline Inverse Distillation (DROID) that leverages diversity to improve generalization performance by decomposing common-task and individual-specific strategies and distilling knowledge in both the reward and policy spaces. We ground DROID in a novel and uniquely challenging Mars rover path-planning problem for NASA's Mars Curiosity Rover. We curate a novel dataset along 154 Sols (Martian days) and conduct a novel, empirical investigation to characterize heterogeneity in the dataset. We find DROID outperforms prior SOTA OLfD techniques, leading to a 21% improvement in modeling expert behaviors and 90% closer to the task objective of reaching the final destination. We also benchmark DROID on the OpenAI Gym Cartpole and Lunar Lander environments and find DROID achieves 23% (significantly) better performance modeling unseen holdout heterogeneous demonstrations.

Translated text

Key words

Learning from Heterogeneous Demonstration,Network Distillation,Offline Imitation Learning

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined