# Provably Efficient Exploration in Policy Optimization

ICML, pp. 1283-1294, 2019.

EI

Weibo:

Abstract:

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this pap...More

Code:

Data:

Introduction

- Coupled with powerful function approximators such as neural networks, policy optimization plays a key role in the tremendous empirical successes of deep reinforcement learning (Silver et al, 2016, 2017; Duan et al, 2016; OpenAI, 2019; Wang et al, 2018).
- Vanilla policy gradient can be shown to suffer from exponentially large variance in the well-known “combination lock” setting (Kakade, 2003; Leffler et al, 2007; Azar et al, 2012a), which only has a finite state space

Highlights

- Coupled with powerful function approximators such as neural networks, policy optimization plays a key role in the tremendous empirical successes of deep reinforcement learning (Silver et al, 2016, 2017; Duan et al, 2016; OpenAI, 2019; Wang et al, 2018)
- In a more practical setting, the agent sequentially explores the state space, and exploits the information at hand by taking the actions that lead to higher expected total reward. Such an exploration-exploitation tradeoff is better captured by the aforementioned statistical question regarding the regret or sample complexity, which remains even more challenging to answer than the computational question. As a result, such a lack of statistical understanding hinders the development of more sample-efficient policy optimization algorithms beyond heuristics
- Vanilla policy gradient can be shown to suffer from exponentially large variance in the well-known “combination lock” setting (Kakade, 2003; Leffler et al, 2007; Azar et al, 2012a), which only has a finite state space
- We propose the first policy optimization algorithm that incorporates exploration in a principled manner
- Despite the differences between policy-based and value-based reinforcement learning, our work shows that the general principle of “optimism in the face of uncertainty” (Auer et al, 2002; Bubeck and Cesa-Bianchi, 2012) can be carried over from existing algorithms based on value iteration, e.g., optimistic least-squares value iteration, into policy optimization algorithms, e.g., natural policy gradient, trust-region policy optimization, and policy optimization, to make them sample-efficient, which further leads to a new general principle of “conservative optimism in the face of uncertainty and adversary” that allows adversarially chosen reward functions

Conclusion

- The authors consider the ideal setting where the transition dynamics are known, which, by the Bellman equation defined in (2.4), allows them to access the Q-function Qπh,k for any policy π and (h, k) ∈ [H]×[K] once given the reward function rk.
- The following lemma connects the difference between two policies to the difference between their expected total rewards through the Q-function.
- The following lemma characterizes the policy improvement step defined in (3.2), where the updated policy πk takes the closed form in (3.3)

Summary

## Introduction:

Coupled with powerful function approximators such as neural networks, policy optimization plays a key role in the tremendous empirical successes of deep reinforcement learning (Silver et al, 2016, 2017; Duan et al, 2016; OpenAI, 2019; Wang et al, 2018).- Vanilla policy gradient can be shown to suffer from exponentially large variance in the well-known “combination lock” setting (Kakade, 2003; Leffler et al, 2007; Azar et al, 2012a), which only has a finite state space
## Conclusion:

The authors consider the ideal setting where the transition dynamics are known, which, by the Bellman equation defined in (2.4), allows them to access the Q-function Qπh,k for any policy π and (h, k) ∈ [H]×[K] once given the reward function rk.- The following lemma connects the difference between two policies to the difference between their expected total rewards through the Q-function.
- The following lemma characterizes the policy improvement step defined in (3.2), where the updated policy πk takes the closed form in (3.3)

Related work

- Our work is based on the aforementioned line of recent work (Fazel et al, 2018; Yang et al, 2019a; Abbasi-Yadkori et al, 2019a,b; Bhandari and Russo, 2019; Liu et al, 2019; Agarwal et al, 2019; Wang et al, 2019) on the computational efficiency of policy optimization, which covers PG, NPG, TRPO, PPO, and AC. In particular, OPPO is based on PPO (and similarly, NPG and TRPO), which has been shown to converge to the globally optimal policy at sublinear rates in tabular and linear settings, as well as nonlinear settings involving neural networks (Liu et al, 2019; Wang et al, 2019). However, without assuming the access to a “simulator” or finite concentratability coefficients, both of which imply that the state space is already well explored, it remains unclear whether any of such algorithms is sampleefficient, that is, attains a finite regret or sample complexity. In comparison, by incorporating uncertainty quantification into the action-value function at each update, which explicitly encourages exploration, OPPO not only attains the same computational efficiency as NPG,

√ TRPO, and PPO, but is also shown to be sample-efficient with a d3H3T -regret up to logarithmic factors. Our work is closely related to another line of work (Even-Dar et al, 2009; Yu et al., 2009; Neu et al, 2010a,b; Zimin and Neu, 2013; Neu et al, 2012; Rosenberg and Mansour, 2019a,b) on online MDPs with adversarially chosen reward functions, which mostly focuses on the tabular setting.

• Assuming the transition dynamics are known and the full information of the reward functions is available, the work of Even-Dar et al (2009) establishes a τ 2T · log |A|regret, where A is the action space, |A| is its cardinality, and τ upper bounds the mixing time of the MDP. See also the work of Yu et al (2009), which establishes a T 2/3-regret in a similar setting.

Reference

- Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C. and Weisz, G. (2019a). POLITEX: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, vol. 97.
- Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C. and Weisz, G. (2019b). Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479.
- Abbasi-Yadkori, Y., Pal, D. and Szepesvari, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
- Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
- Antos, A., Szepesvari, C. and Munos, R. (2008). Fitted Q-iteration in continuous actionspace mdps. In Advances in Neural Information Processing Systems.
- Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47 235–256.
- Azar, M. G., Gomez, V. and Kappen, H. J. (2012a). Dynamic policy programming. Journal of Machine Learning Research, 13 3207–3245.
- Azar, M. G., Munos, R., Ghavamzadaeh, M. and Kappen, H. J. (2011). Speedy Q-learning. In Advances in Neural Information Processing Systems.
- Azar, M. G., Munos, R. and Kappen, B. (2012b). On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461.
- Azar, M. G., Osband, I. and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning.
- Baxter, J. and Bartlett, P. L. (2000). Direct gradient-based reinforcement learning. In International Symposium on Circuits and Systems.
- Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- Boyan, J. A. (2002). Least-squares temporal difference learning. Machine Learning, 49 233– 246.
- Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22 33–57.
- Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5 1–122.
- Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge.
- Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360.
- Chu, W., Li, L., Reyzin, L. and Schapire, R. (2011). Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics.
- Dani, V., Hayes, T. P. and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. Conference on Learning Theory.
- Dann, C., Lattimore, T. and Brunskill, E. (2017). Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems. √
- Dong, K., Peng, J., Wang, Y. and Zhou, Y. (2019). n-regret for learning in Markov decision processes with function approximation and low Bellman rank. arXiv preprint arXiv:1909.02506.
- Du, S. S., Kakade, S. M., Wang, R. and Yang, L. F. (2019a). Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
- Du, S. S., Luo, Y., Wang, R. and Zhang, H. (2019b). Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321.
- Duan, Y., Chen, X., Houthooft, R., Schulman, J. and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
- Even-Dar, E., Kakade, S. M. and Mansour, Y. (2009). Online Markov decision processes. Mathematics of Operations Research, 34 726–736.
- Farahmand, A.-m., Szepesvari, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
- Fazel, M., Ge, R., Kakade, S. M. and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039.
- Geist, M., Scherrer, B. and Pietquin, O. (2019). A theory of regularized Markov decision processes. arXiv preprint arXiv:1901.11275.
- Jaksch, T., Ortner, R. and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11 1563–1600.
- Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2017). Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning.
- Jin, C., Allen-Zhu, Z., Bubeck, S. and Jordan, M. I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems.
- Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2019). Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388.
- Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
- Kakade, S. M. (2003). On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University of London.
- Koenig, S. and Simmons, R. G. (1993). Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence.
- Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems.
- Lattimore, T. and Szepesvari, C. (2019). Learning with good feature representations in bandits and in RL with a generative model. arXiv preprint arXiv:1911.07676.
- Leffler, B. R., Littman, M. L. and Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Association for the Advancement of Artificial Intelligence.
- Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Mania, H., Guy, A. and Recht, B. (2018). Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055.
- Munos, R. and Szepesvari, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 815–857.
- Nemirovsky, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley.
- Neu, G., Antos, A., Gyorgy, A. and Szepesvari, C. (2010a). Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems.
- Neu, G., Gyorgy, A. and Szepesvari, C. (2010b). The online loop-free stochastic shortestpath problem. In Conference on Learning Theory, vol. 2010. 24
- Neu, G., Gyorgy, A. and Szepesvari, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In International Conference on Artificial Intelligence and Statistics.
- Neu, G., Jonsson, A. and Gomez, V. (2017). A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798.
- OpenAI (2019). OpenAI Five. https://openai.com/five/.
- Osband, I. and Van Roy, B. (2016). On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732.
- Osband, I., Van Roy, B. and Wen, Z. (2014). Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635.
- Rosenberg, A. and Mansour, Y. (2019a). Online convex optimization in adversarial Markov decision processes. arXiv preprint arXiv:1905.07773.
- Rosenberg, A. and Mansour, Y. (2019b). Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems.
- Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35 395–411.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Sidford, A., Wang, M., Wu, X., Yang, L. and Ye, Y. (2018a). Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems.
- Sidford, A., Wang, M., Wu, X. and Ye, Y. (2018b). Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Symposium on Discrete Algorithms.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529 484.
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550 354.
- Strehl, A. L., Li, L., Wiewiora, E., Langford, J. and Littman, M. L. (2006). PAC model-free reinforcement learning. In International Conference on Machine Learning.
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT.
- Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
- Tosatto, S., Pirotta, M., D’Eramo, C. and Restelli, M. (2017). Boosted fitted Q-iteration. In International Conference on Machine Learning.
- Van Roy, B. and Dong, S. (2019). Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910.
- Wainwright, M. J. (2019). Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697.
- Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Wang, W. Y., Li, J. and He, X. (2018). Deep reinforcement learning for NLP. In Association for Computational Linguistics.
- Wen, Z. and Van Roy, B. (2017). Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42 762–782.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 229–256.
- Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11 2543–2596.
- Yang, L. and Wang, M. (2019a). Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning.
- Yang, L. F. and Wang, M. (2019b). Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
- Yang, Z., Chen, Y., Hong, M. and Wang, Z. (2019a). On the global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. arXiv preprint arXiv:1907.06246.
- Yang, Z., Xie, Y. and Wang, Z. (2019b). A theoretical analysis of deep Q-learning. arXiv preprint arXiv:1901.00137.
- Yu, J. Y., Mannor, S. and Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34 737–757.
- Zimin, A. and Neu, G. (2013). Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems.
- In this section, we present the supporting lemmas, several of which are adapted from Section D of Jin et al. (2019) and accordingly tailored to our setting.
- Lemma D.3 (Lemma D.4 of Jin et al. (2019) and Theorem 1 of Abbasi-Yadkori et al. (2011)).
- Lemma D.5 (Lemma D.1 of Jin et al. (2019)). For any (k, h) ∈ [K] × [H], it holds that k−1 φ(xτh, aτh)⊤(Λkh)−1φ(xτh, aτh) ≤ d.
- Lemma D.6 (Elliptical Potential Lemma of Dani et al. (2008); Rusmevichientong and
- Tsitsiklis (2010); Chu et al. (2011); Abbasi-Yadkori et al. (2011); Jin et al. (2019)). Let

Tags

Comments