# On the Global Convergence Rates of Softmax Policy Gradient Methods

ICML, pp.6820-6829, (2020)

EI

关键词

摘要

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptot...更多

代码：

数据：

简介

- The policy gradient is one of the most foundational concepts in Reinforcement Learning (RL), lying at the core of policy-search and actor-critic methods.
- The policy gradient theorem (Sutton et al, 2000), in particular, establishes a general foundation for policy search methods, by showing that an unbiased estimate of the gradient of a policy’s expected return with respect to its parameters can still be recovered from an approximate value function.
- Important new progress in understanding the convergence behavior of policy gradient has been achieved in the tabular setting.
- Agarwal et al (2019) contributed further progresses by showing that (1) without param√etrization, projected gradient ascent converges at rate O(1/ t) to a global optimum; and (2) with softmax parametrization, policy gradient converges asymptotically. Agarwal et al (2019) analyze other variants of policy gradient, and show that p√olicy gradient with relative entropy converges at rate O(1/ t), natural policy gradient converges at rate O(1/t), and given a “compatible” function app√roximation natural policy gradient converges at rate O(1/ t)

重点内容

- The policy gradient is one of the most foundational concepts in Reinforcement Learning (RL), lying at the core of policy-search and actor-critic methods
- Little has been known about the global convergence behavior of policy gradient ascent
- We prove that with the true gradient, policy gradient methods with a softmax parametrization converge to the optimal policy at a O(1/t) rate, with constants depending on the problem and initialization
- With a few other properties we describe, it can be shown that softmax policy gradient ascent achieves a O(1/t) convergence rate
- In this paper we focus on the policy gradient method that uses the softmax parametrization
- We show matching bounds O(1/t) and Ω(1/t) for the tabular setting of softmax policy gradient methods, which is a faster rate than those obtained for closely related policy gradient methods in previous work

结论

**Conclusions and Future Work**

The authors show matching bounds O(1/t) and Ω(1/t) for the tabular setting of softmax policy gradient methods, which is a faster rate than those obtained for closely related policy gradient methods in previous work.- It may be interesting to find new uses for non-uniform Łojasiewicz inequalities in non-convex optimization and for the notion of Łojasiewicz degree

总结

## Introduction:

The policy gradient is one of the most foundational concepts in Reinforcement Learning (RL), lying at the core of policy-search and actor-critic methods.- The policy gradient theorem (Sutton et al, 2000), in particular, establishes a general foundation for policy search methods, by showing that an unbiased estimate of the gradient of a policy’s expected return with respect to its parameters can still be recovered from an approximate value function.
- Important new progress in understanding the convergence behavior of policy gradient has been achieved in the tabular setting.
- Agarwal et al (2019) contributed further progresses by showing that (1) without param√etrization, projected gradient ascent converges at rate O(1/ t) to a global optimum; and (2) with softmax parametrization, policy gradient converges asymptotically. Agarwal et al (2019) analyze other variants of policy gradient, and show that p√olicy gradient with relative entropy converges at rate O(1/ t), natural policy gradient converges at rate O(1/t), and given a “compatible” function app√roximation natural policy gradient converges at rate O(1/ t)
## Conclusion:

**Conclusions and Future Work**

The authors show matching bounds O(1/t) and Ω(1/t) for the tabular setting of softmax policy gradient methods, which is a faster rate than those obtained for closely related policy gradient methods in previous work.- It may be interesting to find new uses for non-uniform Łojasiewicz inequalities in non-convex optimization and for the notion of Łojasiewicz degree

引用论文

- Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
- Ahmed, Z., Le Roux, N., Norouzi, M., and Schuurmans, D. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pp. 151–160, 2019.
- Barta, T. Rate of convergence to equilibrium and Łojasiewicz-type estimates. Journal of Dynamics and Differential Equations, 29(4):1553–1568, 2017.
- Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
- Golub, G. H. Some modified matrix eigenvalue problems. SIAM Review, 15(2):318–334, 1973.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870, 2018.
- Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267–274, 2002.
- Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
- Łojasiewicz, S. Une proprietetopologique des sousensembles analytiques reels. Les equations aux derivees partielles, 117:87–89, 1963.
- Mei, J., Xiao, C., Huang, R., Schuurmans, D., and Muller, M. On principled entropy exploration in policy optimization. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3130–3136. AAAI Press, 2019.
- Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785, 2017.
- Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- O’Donoghue, B., Osband, I., and Ionescu, C. Making sense of reinforcement learning and probabilistic inference. arXiv preprint arXiv:2001.00805, 2020.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395, 2014.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Xiao, C., Huang, R., Mei, J., Schuurmans, D., and Muller, M. Maximum entropy monte-carlo planning. In Advances in Neural Information Processing Systems, pp. 9516–9524, 2019.
- Zhou, Y., Wang, Z., and Liang, Y. Convergence of cubic regularization for nonconvex optimization under KL property. In Advances in Neural Information Processing Systems, pp. 3760–3769, 2018.
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928– 1937, 2016.
- Proof. See Agarwal et al. (2019, Lemma C.1). Our proof is for completeness. According to Theorem 1,
- Proof. Consider the following example: r = (1, 9/10, 1/10), θ1 = (0, 0, 0), πθ1 = softmax(θ1) = (1/3, 1/3, 1/3), θ2 = (ln 9, ln 16, ln 25), and πθ2 = softmax(θ2) = (9/50, 16/50, 25/50). We have, 1 2 259 1777 14216

标签

评论