Safe Model-based Reinforcement Learning with Stability Guarantees

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017.

Cited by: 263|Bibtex|Views66
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We have shown how classical reinforcement learning can be combined with safety constraints in terms of stability

Abstract:

Reinforcement learning is a powerful paradigm for learning optimal policies from experimental data. However, to find optimal policies, most reinforcement learning algorithms explore all possible actions, which may be harmful for real-world systems. As a consequence, learning algorithms are rarely applied on safety-critical systems in the ...More

Code:

Data:

0
Introduction
  • While reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for example on the Atari platform [2], they are rarely applied to real-world physical systems outside of academia.
  • The authors define the safety constraint on the state divergence that occurs when leaving the region of attraction
  • This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics f (·) are not allowed to drive the system outside the region of attraction.
  • To expand the safe set the authors need to generalize learned knowledge about the dynamics to states that the authors have not visited
  • To this end, the authors restrict themselves to the general and practically relevant class of models that are Lipschitz continuous.
  • The considered control policies π lie in a set ΠL of functions that are Lπ-Lipschitz continuous with respect to the 1-norm
Highlights
  • While reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for example on the Atari platform [2], they are rarely applied to real-world physical systems outside of academia
  • We need to specify a control policy π : X → U that, given the current state, determines the appropriate control action that drives the system to some goal state, which we set as the origin without loss of generality [4]
  • We define the safety constraint on the state divergence that occurs when leaving the region of attraction. This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics f (·) are not allowed to drive the system outside the region of attraction
  • In a discontinuous system even a slight change in the control policy can lead to drastically different behavior
  • We have shown how classical reinforcement learning can be combined with safety constraints in terms of stability
  • We believe that our results present an important first step towards safe reinforcement learning algorithms that are applicable to real-world problems
Methods
  • Experiments A Python implementation of Algorithm

    1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.

    The authors verify the approach on an inverted pendulum benchmark problem.
  • 1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.
  • The true, continuous-time dynamics are given by ml2ψ = gml sin(ψ) − λψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum.
  • The authors use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction.
  • The authors use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors
Conclusion
  • The authors have shown how classical reinforcement learning can be combined with safety constraints in terms of stability.
  • The authors showed how to safely optimize policies and give stability certificates based on statistical models of the dynamics.
  • The authors provided theoretical safety and exploration guarantees for an algorithm that can drive the system to desired state-action pairs during learning.
  • The authors believe that the results present an important first step towards safe reinforcement learning algorithms that are applicable to real-world problems
Summary
  • Introduction:

    While reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for example on the Atari platform [2], they are rarely applied to real-world physical systems outside of academia.
  • The authors define the safety constraint on the state divergence that occurs when leaving the region of attraction
  • This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics f (·) are not allowed to drive the system outside the region of attraction.
  • To expand the safe set the authors need to generalize learned knowledge about the dynamics to states that the authors have not visited
  • To this end, the authors restrict themselves to the general and practically relevant class of models that are Lipschitz continuous.
  • The considered control policies π lie in a set ΠL of functions that are Lπ-Lipschitz continuous with respect to the 1-norm
  • Methods:

    Experiments A Python implementation of Algorithm

    1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.

    The authors verify the approach on an inverted pendulum benchmark problem.
  • 1 and the experiments based on TensorFlow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.
  • The true, continuous-time dynamics are given by ml2ψ = gml sin(ψ) − λψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum.
  • The authors use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction.
  • The authors use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors
  • Conclusion:

    The authors have shown how classical reinforcement learning can be combined with safety constraints in terms of stability.
  • The authors showed how to safely optimize policies and give stability certificates based on statistical models of the dynamics.
  • The authors provided theoretical safety and exploration guarantees for an algorithm that can drive the system to desired state-action pairs during learning.
  • The authors believe that the results present an important first step towards safe reinforcement learning algorithms that are applicable to real-world problems
Funding
  • This research was supported by SNSF grant 200020_159557, the Max Planck ETH Center for Learning Systems, NSERC grant RGPIN-2014-04634, and the Ontario Early Researcher Award
Reference
  • Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. MIT press, 1998.
    Google ScholarFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv:1606.06565 [cs], 2016.
    Findings
  • Hassan K. Khalil and J. W. Grizzle. Nonlinear systems, volume 3. Prentice Hall, 1996.
    Google ScholarLocate open access versionFindings
  • Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning – an overview. In Modelling and Simulation for Autonomous Systems, pages 357–375.
    Google ScholarFindings
  • Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research (JMLR), 16:1437–1480, 2015.
    Google ScholarLocate open access versionFindings
  • Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes. Automatica, 35(2):301–309, 1999.
    Google ScholarLocate open access versionFindings
  • Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res.(JAIR), 24:81–108, 2005.
    Google ScholarLocate open access versionFindings
  • Aviv Tamar, Shie Mannor, and Huan Xu. Scaling Up Robust MDPs by Reinforcement Learning. In Proc. of the International Conference on Machine Learning (ICML), 2014.
    Google ScholarLocate open access versionFindings
  • Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov Decision Processes. Mathematics of Operations Research, 38(1):153–183, 2012.
    Google ScholarLocate open access versionFindings
  • Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In Proc. of the International Conference on Machine Learning (ICML), pages 1711–1718, 2012.
    Google ScholarLocate open access versionFindings
  • Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian processes. pages 4305–4313, 2016.
    Google ScholarFindings
  • Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2219–2225, 2006.
    Google ScholarLocate open access versionFindings
  • Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proc. of the International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Jonas Mockus. Bayesian approach to global optimization, volume 37 of Mathematics and Its Applications. Springer, Dordrecht, 1989.
    Google ScholarFindings
  • Carl Edward Rasmussen and Christopher K.I Williams. Gaussian processes for machine learning. MIT Press, Cambridge MA, 2006.
    Google ScholarFindings
  • Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint. Safe exploration for active learning with Gaussian processes. In Machine Learning and Knowledge Discovery in Databases, number 9286, pages 133–149. Springer International Publishing, 2015.
    Google ScholarLocate open access versionFindings
  • Yanan Sui, Alkis Gotovos, Joel W. Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In Proc. of the International Conference on Machine Learning (ICML), pages 997–1005, 2015.
    Google ScholarLocate open access versionFindings
  • Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 493–496, 2016.
    Google ScholarLocate open access versionFindings
  • J. Garcia and F. Fernandez. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, pages 515–564, 2012.
    Google ScholarLocate open access versionFindings
  • Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. In Proc. of the European Symposium on Artificial Neural Networks (ESANN), pages 143–148, 2008.
    Google ScholarLocate open access versionFindings
  • Theodore J. Perkins and Andrew G. Barto. Lyapunov design for safe reinforcement learning. The Journal of Machine Learning Research, 3:803–832, 2003.
    Google ScholarLocate open access versionFindings
  • Dorsa Sadigh and Ashish Kapoor. Safe control under uncertainty with Probabilistic Signal Temporal Logic. In Proc. of Robotics: Science and Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Chris J. Ostafew, Angela P. Schoellig, and Timothy D. Barfoot. Robust constrained learningbased NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research (IJRR), 35(13):1547–1536, 2016.
    Google ScholarLocate open access versionFindings
  • Anil Aswani, Humberto Gonzalez, S. Shankar Sastry, and Claire Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.
    Google ScholarLocate open access versionFindings
  • Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie N. Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proc. of the IEEE Conference on Decision and Control (CDC), pages 1424–1431, 2014.
    Google ScholarLocate open access versionFindings
  • Ruxandra Bobiti and Mircea Lazar. A sampling approach to finding Lyapunov functions for nonlinear discrete-time systems. In Proc. of the European Control Conference (ECC), pages 561–566, 2016.
    Google ScholarLocate open access versionFindings
  • Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause. Safe learning of regions of attraction in nonlinear systems with Gaussian processes. In Proc. of the Conference on Decision and Control (CDC), pages 4661–4666, 2016.
    Google ScholarLocate open access versionFindings
  • Julia Vinogradska, Bastian Bischoff, Duy Nguyen-Tuong, Henner Schmidt, Anne Romer, and Jan Peters. Stability of controllers for Gaussian process forward models. In Proceedings of the International Conference on Machine Learning (ICML), pages 545–554, 2016.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proc. of the International Conference on Learning Representations (ICLR), 2014.
    Google ScholarLocate open access versionFindings
  • Huijuan Li and Lars Grüne. Computation of local ISS Lyapunov functions for discrete-time systems via linear programming. Journal of Mathematical Analysis and Applications, 438(2):701– 719, 2016.
    Google ScholarLocate open access versionFindings
  • Peter Giesl and Sigurdur Hafstein. Review on computational methods for Lyapunov functions. Discrete and Continuous Dynamical Systems, Series B, 20(8):2291–2337, 2015.
    Google ScholarLocate open access versionFindings
  • Bernhard Schölkopf. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass, 2002.
    Google ScholarFindings
  • Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. arXiv preprint arXiv:1704.00445, 2017.
    Findings
  • Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
    Google ScholarLocate open access versionFindings
  • Warren B. Powell. Approximate dynamic programming: solving the curses of dimensionality. John Wiley & Sons, 2007.
    Google ScholarFindings
  • Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs], 2016.
    Findings
  • Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. GPflow: a Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, 2017.
    Google ScholarLocate open access versionFindings
  • Scott Davies. Multidimensional triangulation and interpolation for reinforcement learning. In Proc. of the Conference on Neural Information Processing Systems (NIPS), pages 1005–1011, 1996.
    Google ScholarLocate open access versionFindings
  • Andreas Christmann and Ingo Steinwart. Support Vector Machines. Information Science and Statistics. Springer, New York, NY, 2008.
    Google ScholarLocate open access versionFindings
  • 3. Moreover, for ease of notation we assume that
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments