Demonstration actor critic

Neurocomputing(2021)

引用 4|浏览75
暂无评分
摘要
We study the problem of Reinforcement Learning from Demonstrations (RLfD), where the agent has access to not only reward signals from the environment, but also some available expert demonstrations. Recent works absorb ingredients from imitation learning and utilize demonstration data as reward reshaping. Despite their success, these methods update policy over these states seen in the demonstration data, in the same way as other states in the state space, overlooking the validity of direct supervision signals on these states. To address this issue, we propose a novel RLfD objective function with a new shaping reward, by optimizing which can directly leverage the supervision signal on these demonstrated states. We propose a general framework for policy optimization of the proposed objective, with convergence guarantees under the classic tabular setting. Based on that, we further make some approximations based on deep neural networks, and then introduce a new practical algorithm, called Demonstration Actor Critic (DAC) in large continuous domains. Extensive experiments on a range of popular benchmark sparse-reward tasks show that our method can lead to significant performance gains over several strong and off-the-shelf baselines.
更多
查看译文
关键词
Reinforcement learning,Expert demonstration,Reward shaping
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要