Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization.

arXiv: Learning(2018)

引用 24|浏览587
暂无评分
摘要
As the most influential variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely applied across various domains with its inherent advantages involving sample efficiency, implementation and parallelism after published. In this paper, a first order gradient reinforcement learning algorithm called policy optimization with penalized point probability distance (POP3D) is proposed as another variant for TRPO and the point probability distance is proven to be a lower bound of the total variance divergence while has inherent advantage over exploration. The paper is organized as follows. First, we start a discussion about weakness of several commonly used algorithms, from which our method are motivated. Secondly, we propose our algorithm to overcome these shortcomings. Then we make more explanations about PPOu0027s improvement mechanism over TPRO from the perspective of solution manifold. Finally, we make quantitative comparisons among several state-of-the-art algorithms based on OpenAI Atari and Mujoco environments, where a baseline is specially designed to act as ablation for our improvement. While keeping almost all beneficial spheres of PPO, it encourages more exploration and overcomes the shortcoming of using Kullback-Leibler divergence, which is prone to instability. Simulation results show that POP3D is highly competitive compared with PPO, since it reaches state-of-the-art within 40 million frame steps on 49 Atari games and competitive scores in continuous domain according to common metrics: final performance and fast learning ability.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要