AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have proposed an Natural Policy Gradient Primal-Dual method for constrained Markov Decision Processes with the primal natural policy gradient ascent and the dual projected sub-gradient descent

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

NIPS 2020, (2020)

Cited by: 0|Views158
Full Text
Bibtex
Weibo

Abstract

We study sequential decision-making problems in which each agent aims to maximize the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon Constrained Markov Decision Processes (CMDPs) problem. Specifically, we propose a new...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) studies sequential decision-making problems where the agent aims to maximize its expected total reward by interacting with an unknown environment over time [44].
  • The authors' convergence guarantees are dimension-free, i.e., the rate is independent of the size of the statpe-action space; (iii) For the general smooth policy class, the authors establish convergence with rate O(1/ T ) for the optimality gap and O(1/T 1/4) for the constraint violation, up to a function approximation error caused by restricted policy parametrization; and (iv) The authors show that two samplebased NPG-PD algorithms that the authors propose inherit such non-asymptotic convergence properties and provide the finite-sample complexity guarantees.
Highlights
  • Reinforcement learning (RL) studies sequential decision-making problems where the agent aims to maximize its expected total reward by interacting with an unknown environment over time [44]
  • We provide a theoretical foundation for the non-asymptotic global convergence of the natural policy gradient (NPG) method in solving constrained Markov Decision Processes (MDPs) (CMDPs) and answer the following questions: (i) can we employ NPG methods for solving CMDPs?; (ii) if and how fast do these methods converge to the globally optimal value within the underlying constraints?; (iii) what is the effect of the function approximation error caused by a restricted policy parametrization?; and (iv) what is the sample complexity of NPG methods?
  • We employ natural policy gradient ascent to update the primal variable and projected sub-gradient descent to update the dual variable; (ii) Even though we show that the maximization problem has a nonconcave objective function and nonconvex constraint set under the softmax poplicy parametrization, we prove that our Natural Policy Gradient Primal-Dual (NPG-PD) method achieves global convergence with rate O(1/ T ) regarding both the optimality gap and the constraint violation, where T is the total number of iterations
  • Our convergence guarantees are dimension-free, i.e., the rate is independent of the size of the statpe-action space; (iii) For the general smooth policy class, we establish convergence with rate O(1/ T ) for the optimality gap and O(1/T 1/4) for the constraint violation, up to a function approximation error caused by restricted policy parametrization; and (iv) We show that two samplebased NPG-PD algorithms that we propose inherit such non-asymptotic convergence properties and provide the finite-sample complexity guarantees
  • We have proposed an NPG-PD method for CMDPs with the primal natural policy gradient ascent and the dual projected sub-gradient descent
  • We have proposed two associated sample-based NPG-PD algorithms and established their finite-sample complexity guarantees
Results
  • What the authors capture in Theorem 1 is that on the average the reward value function converges to the global optimal one and the constraint violation decays to zero.
  • The authors establish that the constraint violation enjoys the same rate as the optimality gap under the strong duality in Lemma 2, the problem (3) is nonconcave.
  • If the minimizer has zero compatible function approximation error, the authors have already established the global convergence in Theorem 1 for the softmax parametrization.
  • To verify the convergence theory, the authors provide computational results by simulating the algorithm (8) and its sample-based version: Algorithm 2, for a finite CMDP with random initializations.
  • The authors have proposed an NPG-PD method for CMDPs with the primal natural policy gradient ascent and the dual projected sub-gradient descent.
  • Even though the underlying maximization problem has a nonconcave objective function and a nonconvex constraint set, the authors provide a systematic study of the non-asymptotic convergence properties of this method with either the softmax parametrization or the general parametrization.
  • The authors' work is the first to offer non-asymptotic convergence guarantees of policy-based primal-dual methods for solving infinite-horizon discounted CMDPs. A natural future direction is to investigate how the authors can achieve a fast rate, e.g., O(1/T ), for the NPG-PD method.
  • The authors' research could be used to provide an algorithmic solution for practitioners to solve such constrained problems with non-asymptotic convergence and optimality guarantees.
Conclusion
  • The authors' methodology could be new knowledge for RL researchers on the direct policy search methods for solving infinite-horizon discounted CMDPs. The decision-making processes that build on the research could enjoy the flexibility of adding practical constraints and this would improve a large range of uses, e.g., autonomous systems, healthcare services, and financial and legal services.
  • It is relevant to exploit structure of particular CMDPs in order to provide improved convergence theory
Summary
  • Reinforcement learning (RL) studies sequential decision-making problems where the agent aims to maximize its expected total reward by interacting with an unknown environment over time [44].
  • The authors' convergence guarantees are dimension-free, i.e., the rate is independent of the size of the statpe-action space; (iii) For the general smooth policy class, the authors establish convergence with rate O(1/ T ) for the optimality gap and O(1/T 1/4) for the constraint violation, up to a function approximation error caused by restricted policy parametrization; and (iv) The authors show that two samplebased NPG-PD algorithms that the authors propose inherit such non-asymptotic convergence properties and provide the finite-sample complexity guarantees.
  • What the authors capture in Theorem 1 is that on the average the reward value function converges to the global optimal one and the constraint violation decays to zero.
  • The authors establish that the constraint violation enjoys the same rate as the optimality gap under the strong duality in Lemma 2, the problem (3) is nonconcave.
  • If the minimizer has zero compatible function approximation error, the authors have already established the global convergence in Theorem 1 for the softmax parametrization.
  • To verify the convergence theory, the authors provide computational results by simulating the algorithm (8) and its sample-based version: Algorithm 2, for a finite CMDP with random initializations.
  • The authors have proposed an NPG-PD method for CMDPs with the primal natural policy gradient ascent and the dual projected sub-gradient descent.
  • Even though the underlying maximization problem has a nonconcave objective function and a nonconvex constraint set, the authors provide a systematic study of the non-asymptotic convergence properties of this method with either the softmax parametrization or the general parametrization.
  • The authors' work is the first to offer non-asymptotic convergence guarantees of policy-based primal-dual methods for solving infinite-horizon discounted CMDPs. A natural future direction is to investigate how the authors can achieve a fast rate, e.g., O(1/T ), for the NPG-PD method.
  • The authors' research could be used to provide an algorithmic solution for practitioners to solve such constrained problems with non-asymptotic convergence and optimality guarantees.
  • The authors' methodology could be new knowledge for RL researchers on the direct policy search methods for solving infinite-horizon discounted CMDPs. The decision-making processes that build on the research could enjoy the flexibility of adding practical constraints and this would improve a large range of uses, e.g., autonomous systems, healthcare services, and financial and legal services.
  • It is relevant to exploit structure of particular CMDPs in order to provide improved convergence theory
Related work
  • Our work is related to Lagrangian-based CMDP algorithms [4, 12, 11, 15, 46, 23, 37, 36, 53]. However, convergence guarantees of these algorithms are either local (to stationary-point or locally optimal policies) [11, 15, 46] or asymptotic [12]. When function approximation is used for policy parametrization, [53] recognized the lack of convexity and showed asymptotic convergence (to a stationary point) of a method based on successive convex relaxations. In contrast, we establish global convergence in spite of the lack of convexity. References [36, 37] are closely related to our work. In [37], the authors provide duality analysis for CMDPs in the policy space and propose a provably convergent dual descent algorithm by assuming access to a nonconvex optimization oracle. However, how to obtain the solution to this nonconvex optimization was not analyzed/understood, and the global convergence of their algorithm was not established. In [36], the authors provide a primal-dual algorithm but do not offer any theoretical justification. In spite of the lack of convexity, our work provides global convergence guarantees for a new primal-dual algorithm without using any optimization oracles. Other related policy optimization methods include CPG [47], CPO [2, 51], and IPPO [27]. However, theoretical guarantees for these algorithms are still lacking. Recently, optimism principles have been used for efficient exploration in CMDPs [42, 59, 16, 38, 17, 6]. In comparison, our work focuses on the optimization landscape within a primal-dual framework.
Funding
  • Jovanovicwere supported by the National Science Foundation under Awards ECCS-1708906 and ECCS-1809833
  • Basar was supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, and in part by the Office of Naval Research (ONR) MURI Grant N00014-16-1-2710
Reference
  • N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F. Anderson, B. R. Cooley, M. Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In International Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.
    Google ScholarLocate open access versionFindings
  • J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, volume 70, pages 22–31, 2017.
    Google ScholarLocate open access versionFindings
  • A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. arXiv preprint arXiv:1908.00261, 2019.
    Findings
  • E. Altman. Constrained Markov Decision Processes, volume 7. CRC Press, 1999.
    Google ScholarLocate open access versionFindings
  • F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in neural information processing systems, pages 773–781, 2013.
    Google ScholarLocate open access versionFindings
  • Q. Bai, A. Gattami, and V. Aggarwal. Model-free algorithm and regret analysis for MDPs with peak constraints. arXiv preprint arXiv:2003.05555, 2020.
    Findings
  • A. Beck. First-order Methods in Optimization, volume 25. SIAM, 2017.
    Google ScholarLocate open access versionFindings
  • D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic press, 2014.
    Google ScholarFindings
  • J. Bhandari and D. Russo. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
    Findings
  • J. Bhandari and D. Russo. A note on the linear convergence of policy gradient methods. arXiv preprint arXiv:2007.11120, 2020.
    Findings
  • S. Bhatnagar and K. Lakshmanan. An online actor–critic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.
    Google ScholarLocate open access versionFindings
  • V. S. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & control letters, 54(3):207–213, 2005.
    Google ScholarLocate open access versionFindings
  • S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi. Fast global convergence of natural policy gradient methods with entropy regularization. arXiv preprint arXiv:2007.06558, 2020.
    Findings
  • N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
    Google ScholarFindings
  • Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
    Google ScholarLocate open access versionFindings
  • D. Ding, X. Wei, Z. Yang, Z. Wang, and M. R. Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. arXiv preprint arXiv:2003.00534, 2020.
    Findings
  • Y. Efroni, S. Mannor, and M. Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
    Findings
  • M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476, 2018.
    Google ScholarLocate open access versionFindings
  • J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
    Google ScholarLocate open access versionFindings
  • S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531–1538, 2002.
    Google ScholarLocate open access versionFindings
  • A. Koppel, K. Zhang, H. Zhu, and T. Basar. Projected stochastic primal-dual method for constrained online learning with kernels. IEEE Transactions on Signal Processing, 67(10):2528– 2542, 2019.
    Google ScholarLocate open access versionFindings
  • Q. Liang, F. Que, and E. Modiano. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
    Findings
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • T. Lin, C. Jin, and M. I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • B. Liu, Q. Cai, Z. Yang, and Z. Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Liu, K. Zhang, T. Basar, and W. Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. In Advances in Neural Information Processing Systems, 2020.
    Google ScholarLocate open access versionFindings
  • M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(Sep):2503–2528, 2012.
    Google ScholarLocate open access versionFindings
  • J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanovic. On the linear convergence of random search for discrete-time LQR. IEEE Control Systems Letters, 5(3):989–994, 2020.
    Google ScholarLocate open access versionFindings
  • H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanovic. Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. IEEE Transactions on Automatic Control, 2019. submitted; also arXiv:1912.11899.
    Findings
  • M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of nonconvex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
    Google ScholarLocate open access versionFindings
  • M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
    Google ScholarLocate open access versionFindings
  • S. Paternain, M. Calvo-Fullana, L. F. Chamon, and A. Ribeiro. Safe policies for reinforcement learning via primal-dual methods. arXiv preprint arXiv:1911.09101, 2019.
    Findings
  • S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pages 7553–7563, 2019.
    Google ScholarLocate open access versionFindings
  • S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang. Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions. arXiv preprint arXiv:2003.00660, 2020.
    Findings
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • L. Shani, Y. Efroni, and S. Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. In AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • R. Singh, A. Gupta, and N. B. Shroff. Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
    Findings
  • T. Spooner and R. Savani. A natural actor-critic algorithm with downside risk constraints. arXiv preprint arXiv:2007.04203, 2020.
    Findings
  • R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • E. Uchibe and K. Doya. Constrained reinforcement learning from intrinsic and extrinsic rewards. In International Conference on Development and Learning, pages 163–168, 2007.
    Google ScholarLocate open access versionFindings
  • L. Wang, Q. Cai, Z. Yang, and Z. Wang. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. International Conference on Measurement and Modeling of Computer Systems, 2020.
    Google ScholarLocate open access versionFindings
  • J. Yang, N. Kiyavash, and N. He. Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems. arXiv preprint arXiv:2002.09621, 2020.
    Findings
  • T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
    Google ScholarLocate open access versionFindings
  • M. Yu, Z. Yang, M. Kolar, and Z. Wang. Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 3121–3133, 2019.
    Google ScholarLocate open access versionFindings
  • J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pages 6137–6146, 2018.
    Google ScholarLocate open access versionFindings
  • J. Zhang, A. Koppel, A. S. Bedi, C. Szepesvari, and M. Wang. Variational policy gradient method for reinforcement learning with general utilities. arXiv preprint arXiv:2007.02151, 2020.
    Findings
  • K. Zhang, A. Koppel, H. Zhu, and T. Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 2019.
    Google ScholarLocate open access versionFindings
  • K. Zhang, Z. Yang, and T. Basar. Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, pages 11598–11610, 2019.
    Google ScholarLocate open access versionFindings
  • X. Zhang, K. Zhang, E. Miehling, and T. Basar. Non-cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 9482–9493, 2019.
    Google ScholarLocate open access versionFindings
  • L. Zheng and L. J. Ratliff. Constrained upper confidence reinforcement learning. In Conference on Learning for Dynamics and Control, 2020.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科