Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

NIPS 2020, 2020.

Cited by: 1|Bibtex|Views119
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We introduced H-Upper-Confidence Reinforcement Learning: a practical optimistic-exploration algorithm for deep Model-Based Reinforcement Learning

Abstract:

Model-based reinforcement learning algorithms with probabilistic dynamical models are amongst the most data-efficient learning methods. This is often attributed to their ability to distinguish between epistemic and aleatoric uncertainty. However, while most algorithms distinguish these two uncertainties for {\em learning} the model, the...More

Code:

Data:

0
Introduction
  • Model-Based Reinforcement Learning (MBRL) with probabilistic dynamical models can solve many challenging high-dimensional tasks with impressive sample efficiency (Chua et al, 2018).
  • To optimize the policy, practical algorithms marginalize over both the aleatoric and epistemic uncertainty to optimize the expected performance under the current model, as in PILCO (Deisenroth and Rasmussen, 2011).
  • This greedy exploitation can cause the optimization to get stuck in local minima even in simple environments like the swing-up of an inverted pendulum: In Fig. 1, the authors show that for large action penalties, the
Highlights
  • Model-Based Reinforcement Learning (MBRL) with probabilistic dynamical models can solve many challenging high-dimensional tasks with impressive sample efficiency (Chua et al, 2018)
  • Aleatoric uncertainty is inherent to the system, whereas epistemic uncertainty arises from data scarcity (Der Kiureghian and Ditlevsen, 2009)
  • We believe that the poor performance of Thompson sampling relative to H-Upper-Confidence Reinforcement Learning (UCRL) suggests that the five heads that we use are sufficient to construct reasonable confidence intervals, but do not comprise a rich enough posterior distribution for Thompson Sampling. 7-DOF PR2 Robot we evaluate how H-UCRL performs in higher dimensional problems
  • We introduced H-UCRL: a practical optimistic-exploration algorithm for deep MBRL
  • The key idea is a reduction from optimistic exploration to greedy exploitation in an augmented policy space. This insight enables the use of highly effective standard MBRL algorithms that previously were restricted to greedy exploitation and Thompson sampling
  • We provided a theoretical analysis of H-UCRL and show that it attains sublinear regret for some models
Methods
  • All exploration strategies achieve state-of-the-art performance, which seems to indicate that greedy exploitation is sufficient for these tasks.
  • This is due to the over-actuated dynamics and the reward structure.
  • In the lower right plot of Fig. 3 the authors can see a clear advantage of using H-UCRL at different action penalties, even at zero.
  • This indicates that H-UCRL addresses action penalties, and explores through complex dynamics
Results
  • The theorem ensures that, if the authors evaluate optimistic policies according to (7), the authors eventually achieve performance J(f, πt) arbitrarily close to the optimal performance of J(f, π∗) if IT (S, A) grows at a rate smaller than T.
Conclusion
  • The authors introduced H-UCRL: a practical optimistic-exploration algorithm for deep MBRL.
  • The key idea is a reduction from optimistic exploration to greedy exploitation in an augmented policy space.
  • This insight enables the use of highly effective standard MBRL algorithms that previously were restricted to greedy exploitation and Thompson sampling.
  • H-UCRL performs as well or better than other exploration algorithms, achieving state-of-the-art performance on the evaluated tasks.
Summary
  • Introduction:

    Model-Based Reinforcement Learning (MBRL) with probabilistic dynamical models can solve many challenging high-dimensional tasks with impressive sample efficiency (Chua et al, 2018).
  • To optimize the policy, practical algorithms marginalize over both the aleatoric and epistemic uncertainty to optimize the expected performance under the current model, as in PILCO (Deisenroth and Rasmussen, 2011).
  • This greedy exploitation can cause the optimization to get stuck in local minima even in simple environments like the swing-up of an inverted pendulum: In Fig. 1, the authors show that for large action penalties, the
  • Methods:

    All exploration strategies achieve state-of-the-art performance, which seems to indicate that greedy exploitation is sufficient for these tasks.
  • This is due to the over-actuated dynamics and the reward structure.
  • In the lower right plot of Fig. 3 the authors can see a clear advantage of using H-UCRL at different action penalties, even at zero.
  • This indicates that H-UCRL addresses action penalties, and explores through complex dynamics
  • Results:

    The theorem ensures that, if the authors evaluate optimistic policies according to (7), the authors eventually achieve performance J(f, πt) arbitrarily close to the optimal performance of J(f, π∗) if IT (S, A) grows at a rate smaller than T.
  • Conclusion:

    The authors introduced H-UCRL: a practical optimistic-exploration algorithm for deep MBRL.
  • The key idea is a reduction from optimistic exploration to greedy exploitation in an augmented policy space.
  • This insight enables the use of highly effective standard MBRL algorithms that previously were restricted to greedy exploitation and Thompson sampling.
  • H-UCRL performs as well or better than other exploration algorithms, achieving state-of-the-art performance on the evaluated tasks.
Related work
  • MBRL is a promising avenue towards applying RL methods to complex reallife decision problems due to its sample efficiency (Deisenroth et al, 2013). For instance, Kaiser et al (2019) use MBRL to solve the Atari suite, whereas Kamthe and Deisenroth (2018) solve low-dimensional continuous-control problems using GP models and Chua et al (2018) solve highdimensional continuous-control problems using ensembles of probabilistic Neural Networks (NN). All these approaches perform greedy exploitation under the current model using a variant of PILCO (Deisenroth and Rasmussen, 2011). Unfortunately, greedy exploitation is provably optimal only in very limited cases such as linear quadratic regulators (LQR) (Mania et al, 2019).

    Variants of Thompson (posterior) sampling are a common approach for provable exploration in rMeiDnfPosr.ceCmheonwt dlehaurrnyinagn.dInGpoapratilcaunla(r2,0O1s9b)apnrdoveteaal.O( ̃2(0√13T))prreogproestebTohuonmd pfosor ncosanmtinpulionugsfsotrattaesbualnadr actions for this theoretical algorithm, where T is the number of episodes. However, the algorithm requires to sample from posterior GP models, which is intractable over a continuous domain. In general, Thompson sampling can be applied only when it is tractable to sample from the posterior distribution over dynamical models. Moreover, Wang et al (2018) suggest that approximate inference methods may suffer from variance starvation and limited exploration.
Funding
  • Acknowledgments and Disclosure of Funding This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme grant agreement No 815943
  • It was also supported by a fellowship from the Open Philanthropy Project
Reference
  • Yasin Abbasi-Yadkori. Online learning of linearly parameterized control problems. PhD Thesis, University of Alberta, 2012.
    Google ScholarFindings
  • Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
    Google ScholarLocate open access versionFindings
  • Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
    Findings
  • Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
    Google ScholarFindings
  • Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367, 2015.
    Findings
  • Felix Berkenkamp. Safe Exploration in Reinforcement Learning: Theory and Applications in Robotics. PhD thesis, ETH Zurich, 2019.
    Google ScholarFindings
  • Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. No-Regret Bayesian optimization with unknown hyperparameters. Journal of Machine Learning Research (JMLR), 20(50):1–24, 2019.
    Google ScholarLocate open access versionFindings
  • Dimitri P. Bertsekas, Dimitri P. Bertsekas, Dimitri P. Bertsekas, and Dimitri P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
    Google ScholarFindings
  • Ronen I. Brafman and Moshe Tennenholtz. R-max - a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. J. Mach. Learn. Res., 3:213–231, 2003.
    Google ScholarLocate open access versionFindings
  • Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599 [cs], 2010.
    Findings
  • Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224–8234, 2018.
    Google ScholarLocate open access versionFindings
  • Adam D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(Oct):2879–2904, 2011.
    Google ScholarLocate open access versionFindings
  • Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 844–853. PMLR, 2017.
    Google ScholarLocate open access versionFindings
  • Sayak Ray Chowdhury and Aditya Gopalan. Online Learning in Kernelized Markov Decision Processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3197–3205, 2019.
    Google ScholarLocate open access versionFindings
  • Andreas Christmann and Ingo Steinwart. Support Vector Machines. Information Science and Statistics. Springer, New York, NY, 2008.
    Google ScholarLocate open access versionFindings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4754–4765. Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, pages 617–629, 2018.
    Google ScholarLocate open access versionFindings
  • Sebastian Curi, Silvan Melchior, Felix Berkenkamp, and Andreas Krause. Structured variational inference in unstable gaussian process state space models. Proceedings of Machine Learning Research vol, 120:1–11, 2020.
    Google ScholarLocate open access versionFindings
  • Marc Deisenroth and Carl E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In Proc. of the International Conference on Machine Learning (ICML), pages 465–472, 2011.
    Google ScholarLocate open access versionFindings
  • Marc Deisenroth, Dieter Fox, and Carl Rasmussen. Gaussian processes for data-efficient learning in robotics and control. Transactions on Pattern Analysis and Machine Intelligence, 37(2):1–1, 2014.
    Google ScholarLocate open access versionFindings
  • Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. now publishers, 2013.
    Google ScholarFindings
  • Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? Does it matter? Structural Safety, 31(2):105–112, 2009.
    Google ScholarLocate open access versionFindings
  • Andreas Doerr, Christian Daniel, Martin Schiegg, Duy Nguyen-Tuong, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. Probabilistic recurrent state-space models. In International Conference on Machine Learning (ICML), pages 1280–1289. PMLR, 2018.
    Google ScholarLocate open access versionFindings
  • Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Regret bounds for kernel-based reinforcement learning. arXiv preprint arXiv:2004.05599, 2020.
    Findings
  • Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, pages 12203–12213, 2019.
    Google ScholarLocate open access versionFindings
  • Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge university press, 2012.
    Google ScholarFindings
  • Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
    Findings
  • Yarin Gal. Uncertainty in deep learning. PhD Thesis, PhD thesis, University of Cambridge, 2016. Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML, volume 4, page 34, 2016. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning Continuous Control Policies by Stochastic Value Gradients. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2944–2952. Curran Associates, Inc., 2015.
    Findings
  • Lukas Hewing, Elena Arcari, Lukas P Fröhlich, and Melanie N Zeilinger. On simulation and trajectory prediction with gaussian process dynamics. arXiv preprint arXiv:1912.10900, 2019.
    Findings
  • Zhang-Wei Hong, Joni Pajarinen, and Jan Peters. Model-based lookahead reinforcement learning. arXiv preprint arXiv:1908.06012, 2019.
    Findings
  • David H Jacobson. New second-order and first-order algorithms for determining optimal control: A differential dynamic programming approach. Journal of Optimization Theory and Applications, 2 (6):411–440, 1968.
    Google ScholarLocate open access versionFindings
  • Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019.
    Google ScholarLocate open access versionFindings
  • Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
    Findings
  • Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
    Findings
  • Sanket Kamthe and Marc Deisenroth. Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control. In International Conference on Artificial Intelligence and Statistics, pages 1701–1710, 2018.
    Google ScholarLocate open access versionFindings
  • Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K. Sriperumbudur. Gaussian processes and kernel methods: a review on connections and equivalences. arXiv:1807.02582 [stat.ML], 2018.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 358–384. PMLR, 2018.
    Google ScholarLocate open access versionFindings
  • Andreas Krause and Cheng S. Ong. Contextual Gaussian process bandit optimization. In Proc. of Neural Information Processing Systems (NIPS), pages 2447–2455, 2011.
    Google ScholarLocate open access versionFindings
  • Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018.
    Findings
  • Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018.
    Findings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • Armin Lederer, Jonas Umlauft, and Sandra Hirche. Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control. arXiv:1906.01376 [cs, stat], 2019.
    Findings
  • Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
    Google ScholarLocate open access versionFindings
  • Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
    Google ScholarLocate open access versionFindings
  • Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv:1509.02971 [cs, stat], 2015.
    Findings
  • Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in neural information processing systems, pages 3258–3266, 2017.
    Google ScholarLocate open access versionFindings
  • Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
    Findings
  • Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Neural Information Processing Systems, pages 10154–10164, 2019.
    Google ScholarLocate open access versionFindings
  • A McHutchon. Modelling nonlinear dynamical systems with Gaussian Processes. PhD thesis, PhD thesis, University of Cambridge, 2014.
    Google ScholarFindings
  • Teodor Mihai Moldovan, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Optimism-driven exploration for nonlinear systems. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 3239–3246. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Manfred Morari and Jay H. Lee. Model predictive control: past, present and future. Computers & Chemical Engineering, 23(4–5):667–682, 1999.
    Google ScholarLocate open access versionFindings
  • Mojmir Mutny and Andreas Krause. Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features. In Advances in Neural Information Processing Systems, pages 9005–9016, 2018.
    Google ScholarLocate open access versionFindings
  • Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Ian Osband, Dan Russo, and Benjamin Van Roy. (More) Efficient Reinforcement Learning via Posterior Sampling. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3003–3011. Curran Associates, Inc., 2013.
    Google ScholarLocate open access versionFindings
  • Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. arXiv:1402.0635 [cs, stat], 2014.
    Findings
  • Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, pages 4026–4034, 2016.
    Google ScholarLocate open access versionFindings
  • Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. In International Conference on Machine Learning, pages 4065–4074, 2018.
    Google ScholarLocate open access versionFindings
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen and Christopher K.I Williams. Gaussian processes for machine learning. MIT Press, Cambridge MA, 2006.
    Google ScholarFindings
  • Arthur Richards and Jonathan P. How. Robust variable horizon model predictive control for vehicle maneuvering. International Journal of Robust and Nonlinear Control, 16(7):333–351, 2006.
    Google ScholarLocate open access versionFindings
  • Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1723–1742, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], 2017.
    Findings
  • Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In Bruce Porter and Raymond Mooney, editors, Machine Learning Proceedings 1990, pages 216–224. Morgan Kaufmann, San Francisco (CA), 1990.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. MIT press, 1998.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments