## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

NIPS 2020, (2020)

Full Text

Weibo

Keywords

Abstract

We study sequential decision-making problems in which each agent aims to maximize the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon Constrained Markov Decision Processes (CMDPs) problem. Specifically, we propose a new...More

Code:

Data:

Introduction

- Reinforcement learning (RL) studies sequential decision-making problems where the agent aims to maximize its expected total reward by interacting with an unknown environment over time [44].
- The authors' convergence guarantees are dimension-free, i.e., the rate is independent of the size of the statpe-action space; (iii) For the general smooth policy class, the authors establish convergence with rate O(1/ T ) for the optimality gap and O(1/T 1/4) for the constraint violation, up to a function approximation error caused by restricted policy parametrization; and (iv) The authors show that two samplebased NPG-PD algorithms that the authors propose inherit such non-asymptotic convergence properties and provide the finite-sample complexity guarantees.

Highlights

- Reinforcement learning (RL) studies sequential decision-making problems where the agent aims to maximize its expected total reward by interacting with an unknown environment over time [44]
- We provide a theoretical foundation for the non-asymptotic global convergence of the natural policy gradient (NPG) method in solving constrained Markov Decision Processes (MDPs) (CMDPs) and answer the following questions: (i) can we employ NPG methods for solving CMDPs?; (ii) if and how fast do these methods converge to the globally optimal value within the underlying constraints?; (iii) what is the effect of the function approximation error caused by a restricted policy parametrization?; and (iv) what is the sample complexity of NPG methods?
- We employ natural policy gradient ascent to update the primal variable and projected sub-gradient descent to update the dual variable; (ii) Even though we show that the maximization problem has a nonconcave objective function and nonconvex constraint set under the softmax poplicy parametrization, we prove that our Natural Policy Gradient Primal-Dual (NPG-PD) method achieves global convergence with rate O(1/ T ) regarding both the optimality gap and the constraint violation, where T is the total number of iterations
- Our convergence guarantees are dimension-free, i.e., the rate is independent of the size of the statpe-action space; (iii) For the general smooth policy class, we establish convergence with rate O(1/ T ) for the optimality gap and O(1/T 1/4) for the constraint violation, up to a function approximation error caused by restricted policy parametrization; and (iv) We show that two samplebased NPG-PD algorithms that we propose inherit such non-asymptotic convergence properties and provide the finite-sample complexity guarantees
- We have proposed an NPG-PD method for CMDPs with the primal natural policy gradient ascent and the dual projected sub-gradient descent
- We have proposed two associated sample-based NPG-PD algorithms and established their finite-sample complexity guarantees

Results

- What the authors capture in Theorem 1 is that on the average the reward value function converges to the global optimal one and the constraint violation decays to zero.
- The authors establish that the constraint violation enjoys the same rate as the optimality gap under the strong duality in Lemma 2, the problem (3) is nonconcave.
- If the minimizer has zero compatible function approximation error, the authors have already established the global convergence in Theorem 1 for the softmax parametrization.
- To verify the convergence theory, the authors provide computational results by simulating the algorithm (8) and its sample-based version: Algorithm 2, for a finite CMDP with random initializations.
- The authors have proposed an NPG-PD method for CMDPs with the primal natural policy gradient ascent and the dual projected sub-gradient descent.
- Even though the underlying maximization problem has a nonconcave objective function and a nonconvex constraint set, the authors provide a systematic study of the non-asymptotic convergence properties of this method with either the softmax parametrization or the general parametrization.
- The authors' work is the first to offer non-asymptotic convergence guarantees of policy-based primal-dual methods for solving infinite-horizon discounted CMDPs. A natural future direction is to investigate how the authors can achieve a fast rate, e.g., O(1/T ), for the NPG-PD method.
- The authors' research could be used to provide an algorithmic solution for practitioners to solve such constrained problems with non-asymptotic convergence and optimality guarantees.

Conclusion

- The authors' methodology could be new knowledge for RL researchers on the direct policy search methods for solving infinite-horizon discounted CMDPs. The decision-making processes that build on the research could enjoy the flexibility of adding practical constraints and this would improve a large range of uses, e.g., autonomous systems, healthcare services, and financial and legal services.
- It is relevant to exploit structure of particular CMDPs in order to provide improved convergence theory

Summary

- Reinforcement learning (RL) studies sequential decision-making problems where the agent aims to maximize its expected total reward by interacting with an unknown environment over time [44].
- The authors' convergence guarantees are dimension-free, i.e., the rate is independent of the size of the statpe-action space; (iii) For the general smooth policy class, the authors establish convergence with rate O(1/ T ) for the optimality gap and O(1/T 1/4) for the constraint violation, up to a function approximation error caused by restricted policy parametrization; and (iv) The authors show that two samplebased NPG-PD algorithms that the authors propose inherit such non-asymptotic convergence properties and provide the finite-sample complexity guarantees.
- What the authors capture in Theorem 1 is that on the average the reward value function converges to the global optimal one and the constraint violation decays to zero.
- The authors establish that the constraint violation enjoys the same rate as the optimality gap under the strong duality in Lemma 2, the problem (3) is nonconcave.
- If the minimizer has zero compatible function approximation error, the authors have already established the global convergence in Theorem 1 for the softmax parametrization.
- To verify the convergence theory, the authors provide computational results by simulating the algorithm (8) and its sample-based version: Algorithm 2, for a finite CMDP with random initializations.
- The authors have proposed an NPG-PD method for CMDPs with the primal natural policy gradient ascent and the dual projected sub-gradient descent.
- Even though the underlying maximization problem has a nonconcave objective function and a nonconvex constraint set, the authors provide a systematic study of the non-asymptotic convergence properties of this method with either the softmax parametrization or the general parametrization.
- The authors' work is the first to offer non-asymptotic convergence guarantees of policy-based primal-dual methods for solving infinite-horizon discounted CMDPs. A natural future direction is to investigate how the authors can achieve a fast rate, e.g., O(1/T ), for the NPG-PD method.
- The authors' research could be used to provide an algorithmic solution for practitioners to solve such constrained problems with non-asymptotic convergence and optimality guarantees.
- The authors' methodology could be new knowledge for RL researchers on the direct policy search methods for solving infinite-horizon discounted CMDPs. The decision-making processes that build on the research could enjoy the flexibility of adding practical constraints and this would improve a large range of uses, e.g., autonomous systems, healthcare services, and financial and legal services.
- It is relevant to exploit structure of particular CMDPs in order to provide improved convergence theory

Related work

- Our work is related to Lagrangian-based CMDP algorithms [4, 12, 11, 15, 46, 23, 37, 36, 53]. However, convergence guarantees of these algorithms are either local (to stationary-point or locally optimal policies) [11, 15, 46] or asymptotic [12]. When function approximation is used for policy parametrization, [53] recognized the lack of convexity and showed asymptotic convergence (to a stationary point) of a method based on successive convex relaxations. In contrast, we establish global convergence in spite of the lack of convexity. References [36, 37] are closely related to our work. In [37], the authors provide duality analysis for CMDPs in the policy space and propose a provably convergent dual descent algorithm by assuming access to a nonconvex optimization oracle. However, how to obtain the solution to this nonconvex optimization was not analyzed/understood, and the global convergence of their algorithm was not established. In [36], the authors provide a primal-dual algorithm but do not offer any theoretical justification. In spite of the lack of convexity, our work provides global convergence guarantees for a new primal-dual algorithm without using any optimization oracles. Other related policy optimization methods include CPG [47], CPO [2, 51], and IPPO [27]. However, theoretical guarantees for these algorithms are still lacking. Recently, optimism principles have been used for efficient exploration in CMDPs [42, 59, 16, 38, 17, 6]. In comparison, our work focuses on the optimization landscape within a primal-dual framework.

Funding

- Jovanovicwere supported by the National Science Foundation under Awards ECCS-1708906 and ECCS-1809833
- Basar was supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, and in part by the Office of Naval Research (ONR) MURI Grant N00014-16-1-2710

Reference

- N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F. Anderson, B. R. Cooley, M. Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In International Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.
- J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, volume 70, pages 22–31, 2017.
- A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. arXiv preprint arXiv:1908.00261, 2019.
- E. Altman. Constrained Markov Decision Processes, volume 7. CRC Press, 1999.
- F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in neural information processing systems, pages 773–781, 2013.
- Q. Bai, A. Gattami, and V. Aggarwal. Model-free algorithm and regret analysis for MDPs with peak constraints. arXiv preprint arXiv:2003.05555, 2020.
- A. Beck. First-order Methods in Optimization, volume 25. SIAM, 2017.
- D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic press, 2014.
- J. Bhandari and D. Russo. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
- J. Bhandari and D. Russo. A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120, 2020.
- S. Bhatnagar and K. Lakshmanan. An online actor–critic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.
- V. S. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & control letters, 54(3):207–213, 2005.
- S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi. Fast global convergence of natural policy gradient methods with entropy regularization. arXiv preprint arXiv:2007.06558, 2020.
- N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
- D. Ding, X. Wei, Z. Yang, Z. Wang, and M. R. Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. arXiv preprint arXiv:2003.00534, 2020.
- Y. Efroni, S. Mannor, and M. Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
- M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476, 2018.
- J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
- S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
- S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531–1538, 2002.
- A. Koppel, K. Zhang, H. Zhu, and T. Basar. Projected stochastic primal-dual method for constrained online learning with kernels. IEEE Transactions on Signal Processing, 67(10):2528– 2542, 2019.
- Q. Liang, F. Que, and E. Modiano. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- T. Lin, C. Jin, and M. I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, 2019.
- B. Liu, Q. Cai, Z. Yang, and Z. Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.
- Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. AAAI Conference on Artificial Intelligence, 2020.
- Y. Liu, K. Zhang, T. Basar, and W. Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. In Advances in Neural Information Processing Systems, 2020.
- M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(Sep):2503–2528, 2012.
- J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, 2020.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
- H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanovic. On the linear convergence of random search for discrete-time LQR. IEEE Control Systems Letters, 5(3):989–994, 2020.
- H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanovic. Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. IEEE Transactions on Automatic Control, 2019. submitted; also arXiv:1912.11899.
- M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of nonconvex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
- M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
- S. Paternain, M. Calvo-Fullana, L. F. Chamon, and A. Ribeiro. Safe policies for reinforcement learning via primal-dual methods. arXiv preprint arXiv:1911.09101, 2019.
- S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pages 7553–7563, 2019.
- S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang. Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions. arXiv preprint arXiv:2003.00660, 2020.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- L. Shani, Y. Efroni, and S. Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. In AAAI Conference on Artificial Intelligence, 2020.
- R. Singh, A. Gupta, and N. B. Shroff. Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
- T. Spooner and R. Savani. A natural actor-critic algorithm with downside risk constraints. arXiv preprint arXiv:2007.04203, 2020.
- R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2018.
- R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
- C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2019.
- E. Uchibe and K. Doya. Constrained reinforcement learning from intrinsic and extrinsic rewards. In International Conference on Development and Learning, pages 163–168, 2007.
- L. Wang, Q. Cai, Z. Yang, and Z. Wang. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2019.
- X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. International Conference on Measurement and Modeling of Computer Systems, 2020.
- J. Yang, N. Kiyavash, and N. He. Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems. arXiv preprint arXiv:2002.09621, 2020.
- T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2020.
- H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
- M. Yu, Z. Yang, M. Kolar, and Z. Wang. Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 3121–3133, 2019.
- J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pages 6137–6146, 2018.
- J. Zhang, A. Koppel, A. S. Bedi, C. Szepesvari, and M. Wang. Variational policy gradient method for reinforcement learning with general utilities. arXiv preprint arXiv:2007.02151, 2020.
- K. Zhang, A. Koppel, H. Zhu, and T. Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 2019.
- K. Zhang, Z. Yang, and T. Basar. Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, pages 11598–11610, 2019.
- X. Zhang, K. Zhang, E. Miehling, and T. Basar. Non-cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 9482–9493, 2019.
- L. Zheng and L. J. Ratliff. Constrained upper confidence reinforcement learning. In Conference on Learning for Dynamics and Control, 2020.

Tags

Comments