Imitation Learning via Off-Policy Distribution Matching

ICLR, 2020.

Cited by: 2|Bibtex|Views64
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We introduced ValueDICE, an algorithm for imitation learning that outperforms the state-of-the-art on standard MuJoCo tasks

Abstract:

When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, ...More

Code:

Data:

Highlights
  • Reinforcement learning (RL) is typically framed as learning a behavior policy based on reward feedback from trial-and-error experience
  • We introduce an algorithm for imitation learning that, on the one hand, performs divergence minimization as in the original Adversarial Imitation Learning methods, while on the other hand, is completely off-policy
  • In addition to being simpler than standard imitation learning methods, we show that our proposed algorithm is able to achieve state-of-the-art performance on a suite of imitation learning benchmarks
  • In Figure 2 we present the results of the imitation learning algorithms given only a single expert trajectory
  • We introduced ValueDICE, an algorithm for imitation learning that outperforms the state-of-the-art on standard MuJoCo tasks
  • We demonstrate the robustness of ValueDICE in a challenging synthetic tabular Markov Decision Process environment, as well as on standard MuJoCo continuous control benchmark environments, and we show increased performance over baselines in both the low and high data regimes
Summary
  • Reinforcement learning (RL) is typically framed as learning a behavior policy based on reward feedback from trial-and-error experience.
  • This realization is at the heart of imitation learning (Ho & Ermon, 2016; Ng et al.; Pomerleau, 1989), in which one aims to learn a behavior policy from a set of expert demonstrations – logged experience data of a near-optimal policy interacting with the environment – without explicit knowledge of rewards.
  • One estimates the density ratio of states and actions between the target distribution and the behavior policy.
  • This way, an imitating behavior policy may be learned to minimize the divergence without the use of explicit rewards.
  • If one were to take a GAIL-like approach, they could use this form of the KL to estimate distribution matching rewards given by r(s, a) = −x∗(s, a), and these could be maximized by any standard RL algorithm.
  • It allows us to express the KL-divergence between dπ and dexp in terms of an objective over a ‘value-function’ ν expressed in an off-policy manner, with expectations over expert demonstrations dexp and initial state distribution p0(·).
  • In addition to estimating a proper divergence between dπ and dexp in an off-policy manner, ValueDICE greatly simplifies the implementation of distribution matching algorithms.
  • In order to make use of the ValueDICE objective (Equation 13) in practical scenarios, where one does not have access to dexp or p0(·) but rather only limited finite samples, we perform several modifications.
  • The original ValueDICE objective uses only expert samples and the initial state distribution.
  • In order to increase the diversity of samples used for training, we consider an alternative objective, with a controllable regularization based on experience in the replay buffer: JDmIiCxE(π, ν) := log E [eν(s,a)−Bπν(s,a)] − (1 − α)(1 − γ) · E [ν(s0, a0)]
  • We introduce a controllable parameter α for incorporating samples from the replay buffer into the data distribution objective, we note that in practice we use a very small α = 0.1.
  • By using samples from the replay buffer in both terms of the objective as opposed to just one, the global optimality of the expert policy is not affected.
  • ValueDICE is able to learn a policy on all states to best match the observed expert state-action occupancies.
  • We compare ValueDICE against Discriminator-Actor-Critic (DAC) (Kostrikov et al, 2019), which is the state-of-the-art in sample-efficient adversarial imitation learning, as well as GAIL (Ho & Ermon, 2016).
  • We demonstrate the robustness of ValueDICE in a challenging synthetic tabular MDP environment, as well as on standard MuJoCo continuous control benchmark environments, and we show increased performance over baselines in both the low and high data regimes
Related work
  • In recent years, the development of Adversarial Imitation Learning has been mostly focused on on-policy algorithms. After Ho & Ermon (2016) proposed GAIL to perform imitation learning via adversarial training, a number of extensions has been introduced. Many of these applications of the AIL framework (Li et al, 2017; Hausman et al, 2017; Fu et al, 2017) maintain the same form of distribution ratio estimation as GAIL which necessitates on-policy samples. In contrast, our work presents an off-policy formulation of the same objective.

    Although several works have attempted to apply the AIL framework to off-policy settings, these previous approaches are markedly different from our own. For example, Kostrikov et al (2019) proposed to train the discriminator in the GAN-like AIL objective using samples from a replay buffer instead of samples from a policy. This changes the distribution ratio estimation to measure a divergence between the expert and the replay. Although we introduce a controllable parameter α for incorporating samples from the replay buffer into the data distribution objective, we note that in practice we use a very small α = 0.1. Furthermore, by using samples from the replay buffer in both terms of the objective as opposed to just one, the global optimality of the expert policy is not affected.
Funding
  • Shows how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective
  • Introduces an algorithm for imitation learning that, on the one hand, performs divergence minimization as in the original AIL methods, while on the other hand, is completely off-policy
Full Text
Your rating :
0

 

Tags
Comments