On the Global Convergence Rates of Softmax Policy Gradient Methods

ICML, pp.6820-6829, (2020)

被引用11|浏览116
EI
下载 PDF 全文
引用
微博一下

摘要

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptot...更多

代码

数据

0
简介
  • The policy gradient is one of the most foundational concepts in Reinforcement Learning (RL), lying at the core of policy-search and actor-critic methods.
  • The policy gradient theorem (Sutton et al, 2000), in particular, establishes a general foundation for policy search methods, by showing that an unbiased estimate of the gradient of a policy’s expected return with respect to its parameters can still be recovered from an approximate value function.
  • Important new progress in understanding the convergence behavior of policy gradient has been achieved in the tabular setting.
  • Agarwal et al (2019) contributed further progresses by showing that (1) without param√etrization, projected gradient ascent converges at rate O(1/ t) to a global optimum; and (2) with softmax parametrization, policy gradient converges asymptotically. Agarwal et al (2019) analyze other variants of policy gradient, and show that p√olicy gradient with relative entropy converges at rate O(1/ t), natural policy gradient converges at rate O(1/t), and given a “compatible” function app√roximation natural policy gradient converges at rate O(1/ t)
重点内容
  • The policy gradient is one of the most foundational concepts in Reinforcement Learning (RL), lying at the core of policy-search and actor-critic methods
  • Little has been known about the global convergence behavior of policy gradient ascent
  • We prove that with the true gradient, policy gradient methods with a softmax parametrization converge to the optimal policy at a O(1/t) rate, with constants depending on the problem and initialization
  • With a few other properties we describe, it can be shown that softmax policy gradient ascent achieves a O(1/t) convergence rate
  • In this paper we focus on the policy gradient method that uses the softmax parametrization
  • We show matching bounds O(1/t) and Ω(1/t) for the tabular setting of softmax policy gradient methods, which is a faster rate than those obtained for closely related policy gradient methods in previous work
结论
  • Conclusions and Future Work

    The authors show matching bounds O(1/t) and Ω(1/t) for the tabular setting of softmax policy gradient methods, which is a faster rate than those obtained for closely related policy gradient methods in previous work.
  • It may be interesting to find new uses for non-uniform Łojasiewicz inequalities in non-convex optimization and for the notion of Łojasiewicz degree
总结
  • Introduction:

    The policy gradient is one of the most foundational concepts in Reinforcement Learning (RL), lying at the core of policy-search and actor-critic methods.
  • The policy gradient theorem (Sutton et al, 2000), in particular, establishes a general foundation for policy search methods, by showing that an unbiased estimate of the gradient of a policy’s expected return with respect to its parameters can still be recovered from an approximate value function.
  • Important new progress in understanding the convergence behavior of policy gradient has been achieved in the tabular setting.
  • Agarwal et al (2019) contributed further progresses by showing that (1) without param√etrization, projected gradient ascent converges at rate O(1/ t) to a global optimum; and (2) with softmax parametrization, policy gradient converges asymptotically. Agarwal et al (2019) analyze other variants of policy gradient, and show that p√olicy gradient with relative entropy converges at rate O(1/ t), natural policy gradient converges at rate O(1/t), and given a “compatible” function app√roximation natural policy gradient converges at rate O(1/ t)
  • Conclusion:

    Conclusions and Future Work

    The authors show matching bounds O(1/t) and Ω(1/t) for the tabular setting of softmax policy gradient methods, which is a faster rate than those obtained for closely related policy gradient methods in previous work.
  • It may be interesting to find new uses for non-uniform Łojasiewicz inequalities in non-convex optimization and for the notion of Łojasiewicz degree
引用论文
  • Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261, 2019.
    Findings
  • Ahmed, Z., Le Roux, N., Norouzi, M., and Schuurmans, D. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pp. 151–160, 2019.
    Google ScholarLocate open access versionFindings
  • Barta, T. Rate of convergence to equilibrium and Łojasiewicz-type estimates. Journal of Dynamics and Differential Equations, 29(4):1553–1568, 2017.
    Google ScholarLocate open access versionFindings
  • Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
    Findings
  • Golub, G. H. Some modified matrix eigenvalue problems. SIAM Review, 15(2):318–334, 1973.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870, 2018.
    Google ScholarLocate open access versionFindings
  • Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
    Google ScholarLocate open access versionFindings
  • Łojasiewicz, S. Une proprietetopologique des sousensembles analytiques reels. Les equations aux derivees partielles, 117:87–89, 1963.
    Google ScholarLocate open access versionFindings
  • Mei, J., Xiao, C., Huang, R., Schuurmans, D., and Muller, M. On principled entropy exploration in policy optimization. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3130–3136. AAAI Press, 2019.
    Google ScholarLocate open access versionFindings
  • Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785, 2017.
    Google ScholarLocate open access versionFindings
  • Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
    Google ScholarLocate open access versionFindings
  • O’Donoghue, B., Osband, I., and Ionescu, C. Making sense of reinforcement learning and probabilistic inference. arXiv preprint arXiv:2001.00805, 2020.
    Findings
  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395, 2014.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Xiao, C., Huang, R., Mei, J., Schuurmans, D., and Muller, M. Maximum entropy monte-carlo planning. In Advances in Neural Information Processing Systems, pp. 9516–9524, 2019.
    Google ScholarLocate open access versionFindings
  • Zhou, Y., Wang, Z., and Liang, Y. Convergence of cubic regularization for nonconvex optimization under KL property. In Advances in Neural Information Processing Systems, pp. 3760–3769, 2018.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928– 1937, 2016.
    Google ScholarLocate open access versionFindings
  • Proof. See Agarwal et al. (2019, Lemma C.1). Our proof is for completeness. According to Theorem 1,
    Google ScholarFindings
  • Proof. Consider the following example: r = (1, 9/10, 1/10), θ1 = (0, 0, 0), πθ1 = softmax(θ1) = (1/3, 1/3, 1/3), θ2 = (ln 9, ln 16, ln 25), and πθ2 = softmax(θ2) = (9/50, 16/50, 25/50). We have, 1 2 259 1777 14216
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
小科