# Safe Model-based Reinforcement Learning with Stability Guarantees

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017.

EI

Weibo:

Abstract:

Reinforcement learning is a powerful paradigm for learning optimal policies from experimental data. However, to find optimal policies, most reinforcement learning algorithms explore all possible actions, which may be harmful for real-world systems. As a consequence, learning algorithms are rarely applied on safety-critical systems in the ...More

Code:

Data:

Introduction

- While reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for example on the Atari platform [2], they are rarely applied to real-world physical systems outside of academia.
- The authors define the safety constraint on the state divergence that occurs when leaving the region of attraction
- This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics f (·) are not allowed to drive the system outside the region of attraction.
- To expand the safe set the authors need to generalize learned knowledge about the dynamics to states that the authors have not visited
- To this end, the authors restrict themselves to the general and practically relevant class of models that are Lipschitz continuous.
- The considered control policies π lie in a set ΠL of functions that are Lπ-Lipschitz continuous with respect to the 1-norm

Highlights

- While reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for example on the Atari platform [2], they are rarely applied to real-world physical systems outside of academia
- We need to specify a control policy π : X → U that, given the current state, determines the appropriate control action that drives the system to some goal state, which we set as the origin without loss of generality [4]
- We define the safety constraint on the state divergence that occurs when leaving the region of attraction. This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics f (·) are not allowed to drive the system outside the region of attraction
- In a discontinuous system even a slight change in the control policy can lead to drastically different behavior
- We have shown how classical reinforcement learning can be combined with safety constraints in terms of stability
- We believe that our results present an important first step towards safe reinforcement learning algorithms that are applicable to real-world problems

Methods

**Experiments A Python implementation of Algorithm**

1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.

The authors verify the approach on an inverted pendulum benchmark problem.- 1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.
- The true, continuous-time dynamics are given by ml2ψ = gml sin(ψ) − λψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum.
- The authors use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction.
- The authors use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors

Conclusion

- The authors have shown how classical reinforcement learning can be combined with safety constraints in terms of stability.
- The authors showed how to safely optimize policies and give stability certificates based on statistical models of the dynamics.
- The authors provided theoretical safety and exploration guarantees for an algorithm that can drive the system to desired state-action pairs during learning.
- The authors believe that the results present an important first step towards safe reinforcement learning algorithms that are applicable to real-world problems

Summary

## Introduction:

While reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for example on the Atari platform [2], they are rarely applied to real-world physical systems outside of academia.- The authors define the safety constraint on the state divergence that occurs when leaving the region of attraction
- This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics f (·) are not allowed to drive the system outside the region of attraction.
- To expand the safe set the authors need to generalize learned knowledge about the dynamics to states that the authors have not visited
- To this end, the authors restrict themselves to the general and practically relevant class of models that are Lipschitz continuous.
- The considered control policies π lie in a set ΠL of functions that are Lπ-Lipschitz continuous with respect to the 1-norm
## Methods:

**Experiments A Python implementation of Algorithm**

1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.

The authors verify the approach on an inverted pendulum benchmark problem.- 1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.
- The true, continuous-time dynamics are given by ml2ψ = gml sin(ψ) − λψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum.
- The authors use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction.
- The authors use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors
## Conclusion:

The authors have shown how classical reinforcement learning can be combined with safety constraints in terms of stability.- The authors showed how to safely optimize policies and give stability certificates based on statistical models of the dynamics.
- The authors provided theoretical safety and exploration guarantees for an algorithm that can drive the system to desired state-action pairs during learning.
- The authors believe that the results present an important first step towards safe reinforcement learning algorithms that are applicable to real-world problems

Funding

- This research was supported by SNSF grant 200020_159557, the Max Planck ETH Center for Learning Systems, NSERC grant RGPIN-2014-04634, and the Ontario Early Researcher Award

Reference

- Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. MIT press, 1998.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv:1606.06565 [cs], 2016.
- Hassan K. Khalil and J. W. Grizzle. Nonlinear systems, volume 3. Prentice Hall, 1996.
- Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning – an overview. In Modelling and Simulation for Autonomous Systems, pages 357–375.
- Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research (JMLR), 16:1437–1480, 2015.
- Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes. Automatica, 35(2):301–309, 1999.
- Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res.(JAIR), 24:81–108, 2005.
- Aviv Tamar, Shie Mannor, and Huan Xu. Scaling Up Robust MDPs by Reinforcement Learning. In Proc. of the International Conference on Machine Learning (ICML), 2014.
- Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov Decision Processes. Mathematics of Operations Research, 38(1):153–183, 2012.
- Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In Proc. of the International Conference on Machine Learning (ICML), pages 1711–1718, 2012.
- Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian processes. pages 4305–4313, 2016.
- Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2219–2225, 2006.
- Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proc. of the International Conference on Machine Learning (ICML), 2017.
- Jonas Mockus. Bayesian approach to global optimization, volume 37 of Mathematics and Its Applications. Springer, Dordrecht, 1989.
- Carl Edward Rasmussen and Christopher K.I Williams. Gaussian processes for machine learning. MIT Press, Cambridge MA, 2006.
- Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint. Safe exploration for active learning with Gaussian processes. In Machine Learning and Knowledge Discovery in Databases, number 9286, pages 133–149. Springer International Publishing, 2015.
- Yanan Sui, Alkis Gotovos, Joel W. Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In Proc. of the International Conference on Machine Learning (ICML), pages 997–1005, 2015.
- Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 493–496, 2016.
- J. Garcia and F. Fernandez. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, pages 515–564, 2012.
- Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. In Proc. of the European Symposium on Artificial Neural Networks (ESANN), pages 143–148, 2008.
- Theodore J. Perkins and Andrew G. Barto. Lyapunov design for safe reinforcement learning. The Journal of Machine Learning Research, 3:803–832, 2003.
- Dorsa Sadigh and Ashish Kapoor. Safe control under uncertainty with Probabilistic Signal Temporal Logic. In Proc. of Robotics: Science and Systems, 2016.
- Chris J. Ostafew, Angela P. Schoellig, and Timothy D. Barfoot. Robust constrained learningbased NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research (IJRR), 35(13):1547–1536, 2016.
- Anil Aswani, Humberto Gonzalez, S. Shankar Sastry, and Claire Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.
- Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie N. Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proc. of the IEEE Conference on Decision and Control (CDC), pages 1424–1431, 2014.
- Ruxandra Bobiti and Mircea Lazar. A sampling approach to finding Lyapunov functions for nonlinear discrete-time systems. In Proc. of the European Control Conference (ECC), pages 561–566, 2016.
- Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause. Safe learning of regions of attraction in nonlinear systems with Gaussian processes. In Proc. of the Conference on Decision and Control (CDC), pages 4661–4666, 2016.
- Julia Vinogradska, Bastian Bischoff, Duy Nguyen-Tuong, Henner Schmidt, Anne Romer, and Jan Peters. Stability of controllers for Gaussian process forward models. In Proceedings of the International Conference on Machine Learning (ICML), pages 545–554, 2016.
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proc. of the International Conference on Learning Representations (ICLR), 2014.
- Huijuan Li and Lars Grüne. Computation of local ISS Lyapunov functions for discrete-time systems via linear programming. Journal of Mathematical Analysis and Applications, 438(2):701– 719, 2016.
- Peter Giesl and Sigurdur Hafstein. Review on computational methods for Lyapunov functions. Discrete and Continuous Dynamical Systems, Series B, 20(8):2291–2337, 2015.
- Bernhard Schölkopf. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass, 2002.
- Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. arXiv preprint arXiv:1704.00445, 2017.
- Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
- Warren B. Powell. Approximate dynamic programming: solving the curses of dimensionality. John Wiley & Sons, 2007.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs], 2016.
- Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. GPflow: a Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, 2017.
- Scott Davies. Multidimensional triangulation and interpolation for reinforcement learning. In Proc. of the Conference on Neural Information Processing Systems (NIPS), pages 1005–1011, 1996.
- Andreas Christmann and Ingo Steinwart. Support Vector Machines. Information Science and Statistics. Springer, New York, NY, 2008.
- 3. Moreover, for ease of notation we assume that

Full Text

Tags

Comments