Variational Delayed Policy Optimization
CoRR(2024)
Abstract
In environments with delayed observation, state augmentation by including
actions within the delay window is adopted to retrieve Markovian property to
enable reinforcement learning (RL). However, state-of-the-art (SOTA) RL
techniques with Temporal-Difference (TD) learning frameworks often suffer from
learning inefficiency, due to the significant expansion of the augmented state
space with the delay. To improve learning efficiency without sacrificing
performance, this work introduces a novel framework called Variational Delayed
Policy Optimization (VDPO), which reformulates delayed RL as a variational
inference problem. This problem is further modelled as a two-step iterative
optimization problem, where the first step is TD learning in the delay-free
environment with a small state space, and the second step is behaviour cloning
which can be addressed much more efficiently than TD learning. We not only
provide a theoretical analysis of VDPO in terms of sample complexity and
performance, but also empirically demonstrate that VDPO can achieve consistent
performance with SOTA methods, with a significant enhancement of sample
efficiency (approximately 50% less amount of samples) in the MuJoCo benchmark.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined