# Safe Reinforcement Learning via Curriculum Induction

NeurIPS, 2020.

EI

Weibo:

Abstract:

In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situat...More

Code:

Data:

Introduction

- Safety is a major concern that prevents application of reinforcement learning (RL) [42] to many practical problems [14].
- The authors view a learning agent, which the authors will call a student, as performing constrained RL
- This framework has been strongly advocated as a promising path to RL safety [37], and expresses safety requirements in terms of an a priori unknown set of feasible safe policies that the student should optimize over.

Highlights

- Safety is a major concern that prevents application of reinforcement learning (RL) [42] to many practical problems [14]
- We propose Curriculum Induction for Safe Reinforcement learning (CISR, “Caesar”), a safe RL approach that lifts several prohibitive assumptions of existing ones
- We view a learning agent, which we will call a student, as performing constrained RL. This framework has been strongly advocated as a promising path to RL safety [37], and expresses safety requirements in terms of an a priori unknown set of feasible safe policies that the student should optimize over. This feasible policy set is often described by a constrained Markov decision process (CMDP) [4]
- soft reset 1 (SR1) and soft reset 2 (SR2) allow the student to learn about the goal without incurring failures thanks to their reset distribution, which is more forgiving that hard reset (HR)’s one
- We introduce CISR, a novel framework for safe RL that avoids many of the impractical assumptions common in the safe RL literature
- We introduce curricula inspired by human learning for safe training and deployment of RL agents and a principled way to optimize them

Methods

- The authors present experiments where CISR efficiently and safely trains deep RL agents in two environments: the Frozen Lake and the Lunar Lander environments from Open AI Gym [8].
- While Frozen Lake has simple dynamics, it demonstrates how safety exacerbates the difficult problem of exploration in goal-oriented environments.
- The authors compare students trained with a curriculum optimized by CISR to students trained with trivial or no curricula in terms of safety and sample efficiency.
- For a detailed overview of the hyperparameters and the environments, see Appendices A and B

Results

- SR1 and SR2 allow the student to learn about the goal without incurring failures thanks to their reset distribution, which is more forgiving that HR’s one
- They result in performance plateaus as the consequence of a mistake in the original environment are quite different from those encountered during training.
- The Optimized curriculum retains the best of both worlds by initially proposing a soft reset intervention that allows the agent to reach the goal and subsequently switching to the hard reset such that the training environment is more similar to the original one.
- The absence of the teacher results in three orders of magnitude more training failures

Conclusion

- The authors introduce CISR, a novel framework for safe RL that avoids many of the impractical assumptions common in the safe RL literature.
- The authors introduce curricula inspired by human learning for safe training and deployment of RL agents and a principled way to optimize them.
- The authors show how training under such optimized curricula results in performance comparable or superior to training without them, while greatly improving safety

Summary

## Introduction:

Safety is a major concern that prevents application of reinforcement learning (RL) [42] to many practical problems [14].- The authors view a learning agent, which the authors will call a student, as performing constrained RL
- This framework has been strongly advocated as a promising path to RL safety [37], and expresses safety requirements in terms of an a priori unknown set of feasible safe policies that the student should optimize over.
## Objectives:

The authors aim to enable the student to learn a policy for CMDP M without violating any constraints in the process.## Methods:

The authors present experiments where CISR efficiently and safely trains deep RL agents in two environments: the Frozen Lake and the Lunar Lander environments from Open AI Gym [8].- While Frozen Lake has simple dynamics, it demonstrates how safety exacerbates the difficult problem of exploration in goal-oriented environments.
- The authors compare students trained with a curriculum optimized by CISR to students trained with trivial or no curricula in terms of safety and sample efficiency.
- For a detailed overview of the hyperparameters and the environments, see Appendices A and B
## Results:

SR1 and SR2 allow the student to learn about the goal without incurring failures thanks to their reset distribution, which is more forgiving that HR’s one- They result in performance plateaus as the consequence of a mistake in the original environment are quite different from those encountered during training.
- The Optimized curriculum retains the best of both worlds by initially proposing a soft reset intervention that allows the agent to reach the goal and subsequently switching to the hard reset such that the training environment is more similar to the original one.
- The absence of the teacher results in three orders of magnitude more training failures
## Conclusion:

The authors introduce CISR, a novel framework for safe RL that avoids many of the impractical assumptions common in the safe RL literature.- The authors introduce curricula inspired by human learning for safe training and deployment of RL agents and a principled way to optimize them.
- The authors show how training under such optimized curricula results in performance comparable or superior to training without them, while greatly improving safety

- Table1: Lunar Lander final performance summary. Noiseless, 2-layer student (Left): The Narrow intervention helps exploration but results in policy performance plateau, the Wide one slows down student learning due to making exploration more challenging, and the Optimized teacher provides the best of both by switching between Narrow and Wide. Students that learn under the Optimized curriculum policy achieve a comparable performance to those training under No-intervention, but suffer three orders of magnitude fewer training failures. Noisy student (Center), One layered-student (Right): The results are similar when we use the curriculum optimized for students with noiseless observations and a 2-layer MLP policy for students with noisy sensors (center) or a 1-layer architecture (right), thus showing teaching policies can be transferred across classes of students. left) shows for each curriculum policy the mean of the students’ final return, success rate and failure rate in the original environment and the average number of failures during training. The Narrow intervention makes exploration less challenging but prevents the students from experiencing big portions of the state space. Thus, it results in fast training that plateaus at low success rates. On the contrary, the Wide intervention makes exploration more complex but it is more similar to the original environment. Therefore, it results in slow learning that cannot achieve high success rates withing Ns interaction units. Optimized retains the best of both worlds by initially using the Narrow intervention to speed up learning and subsequently switching to the Wide one. In No-interv., exploration is easier due to the absence of the teacher that can make it hard for the students to experience a natural ending of the episode. Therefore, No-interv. attains a comparable performance to Optimized. However, the absence of the teacher results in three orders of magnitude more training failures
- Table2: Student’s hyperparameters for the Frozen lake environment
- Table3: Student’s hyperparameters for the Lunar Lander environment
- Table4: Mean and variance of the Gamma hyperpriors for the teacher’s hyperparameters for the
- Table5: Table 5
- Table6: Final deployment performance in Frozen Lake with confidence intervals obtained by training and evaluating the teachers with three different random seeds. The students trained with the optimized curriculum outperform both naive curricula and training in the original environment in terms of success rate and return. All the agents supervised by a teacher are safe during training. In contrast, training directly in the original environment results in many failures. These results are consistent across random seeds, thus showing the robustness of CISR
- Table7: Lunar Lander final deployment performance summary for three different kinds of students with confidence intervals obtained by training and evaluating the teachers with three different random seeds. Noiseless, 2-layer student (Top): The Narrow intervention helps exploration but results in policy performance plateau, the Wide one slows down student learning due to making exploration more challenging, and the Optimized teacher provides the best of both by switching between Narrow and Wide. Students that learn under the Optimized curriculum policy achieve a comparable performance to those training under No-intervention, but suffer three orders of magnitude fewer training failures. Noisy student (Center), One layered-student (Bottom): The results are similar when we use the curriculum optimized for students with noiseless observations and a 2-layer MLP policy for students with noisy sensors (center) or a 1-layer architecture (bottom), thus showing teaching policies can be transferred across classes of students. These results are consistent across random seeds, thus showing the robustness of CISR

Related work

- CISR is a form of curriculum learning (CL) [36]. CL and learning from demonstration (LfD) [13] are two established classes of approaches that rely on a teacher as an aid in training a decision-making agent, but CISR differs from both. In LfD, a teacher provides demonstrations of a good policy for the task at hand, and the student uses them to learn its own policy by behavior cloning [34], online imitation [32], or apprenticeship learning [1]. In contrast, CISR does not assume that the teacher has a policy for the student’s task at all: e.g., a teacher doesn’t need to know how to ride a bike in order to help a child learn to do it. CL generally relies on a teacher to structure the learning process. A range of works [31, 20, 19, 38, 48, 35, 47] explore ways of building a curriculum by modifying the learning environment. CISR is closer to Graves et al [22], which uses a fixed set of environments for the student and also uses a bandit algorithm for the teacher. CISR’s major differences from existing CL work is that (1) it is the first approach, to our knowledge, that uses CL for ensuring safety and (2) uses multiple students for training the teacher, which allows it to induce curricula in a more data-driven, as opposed to heuristic, way. With regards to safe RL, in addition to the literature mentioned at the beginning, one work that considers the same training and test safety constraints as ours is Le et al [28], which proposes a solver for the student’s CMDP. In that work, the student avoids potentially unsafe environment interaction altogether by learning solely from batch data, which places strong assumptions on MDP dynamics and data collection policy neither verifiable nor easily satisfied in practice [39, 9, 3]. We use the same solver, but in an online setting.

Funding

- This work was supported by the Max Planck ETH Center for Learning Systems

Study subjects and analysis

students: 30

HR has zero tolerance, τ = 0 and resets the student to the initial state. We compare five different teaching policies: (i) No-intervention, where students learn in the original environment; (ii-iii-iv) single-intervention, where students learn under each of the interventions fixed for the entire learning duration; (v) Optimized, where we use a curriculum policy optimized with CISR over 30 students. We let each of these curriculum policies train 10 students

students: 10

We compare five different teaching policies: (i) No-intervention, where students learn in the original environment; (ii-iii-iv) single-intervention, where students learn under each of the interventions fixed for the entire learning duration; (v) Optimized, where we use a curriculum policy optimized with CISR over 30 students. We let each of these curriculum policies train 10 students. For analysis purposes, we periodically freeze the students’ policies and evaluate them in the original environment

students and teachers: 10

The Optimized curriculum retains the best of both worlds by initially proposing a soft reset intervention that allows the agent to reach the goal and subsequently switching to the hard reset such that the training environment is more similar to the original one. Table 6 in Appendix B shows the confidence intervals of mean performance across 10 students and teachers trained on 3 seeds, indicating CISR’s robustness. Lunar Lander

students: 10

Similarly to the Frozen Lake evaluation, we compare four curriculum policies: (i) No-intervention, (ii-iii) single-intervention and (iv) Optimized. We let each policy train 10 students and we compare their final performance in the original Lunar Lander. Moreover, we use the Optimized curriculum to train different students than those it was optimized for, thus showing the transferability of curricula

Reference

- Abbeel, P. and Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML.
- Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org.
- Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
- Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.
- Authors, T. G. (2016). GPyOpt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt.
- Berkenkamp, F., Schoellig, A. P., and Krause, A. (2016). Safe controller optimization for quadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 493–496.
- Berkenkamp, F., Turchetta, M., Schoellig, A. P., and Krause, A. (2017). Safe model-based reinforcement learning with stability guarantees. In NiPS.
- Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
- Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. ICML.
- Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. (2018). A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, pages 8092–8101.
- Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E., and Ghavamzadeh, M. (2019). Lyapunovbased safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031.
- Clement, B., Roy, D., Oudeyer, P.-Y., and Lopes, M. (2015). Multi-armed bandits for intelligent tutoring systems. Journal of Educational Data Mining, 7.
- D.Argall, B., Chernova, S., Veloso, M., and Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57.
- Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning.
- El Chamie, M., Yu, Y., and Açıkmese, B. (2016). Convex synthesis of randomized policies for controlled markov chains with density safety upper bound constraints. In 2016 American Control Conference (ACC), pages 6290–6295. IEEE.
- Eysenbach, B., Gu, S., Ibarz, J., and Levine, S. (2018). Leave no trace: Learning to reset for safe and autonomous reinforcement learning. In ICLR.
- Fachantidis, A., Partalas, I., Tsoumakas, G., and Vlahavas, I. (2013). Transferring task models in reinforcement learning agents. Neurocomputing, 107:23–32.
- Fernández, F., García, J., and Veloso, M. (2010). Probabilistic policy reuse for inter-task transfer learning. Robotics and Autonomous Systems, 58(7):866–871.
- Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018). Automatic goal generation for reinforcement learning agents. In ICML.
- Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and Abbeel, P. (2017). Reverse curriculum generation for reinforcement learning. In CoRL.
- Garcıa, J. and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480.
- Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks. In ICML.
- Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. (2018). Stable baselines. https://github.com/hill-a/stable-baselines.
- Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., Lam, V.-D., Bewley, A., and Shah, A. (2019). Learning to drive in a day. In ICRA.
- Kivinen, J. and Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1–63.
- Koller, T., Berkenkamp, F., Turchetta, M., and Krause, A. (2018). Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066. IEEE.
- Lazaric, A. and Restelli, M. (2011). Transfer from multiple mdps. In Advances in Neural Information Processing Systems, pages 1746–1754.
- Le, H. M., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. arXiv preprint arXiv:1903.08738.
- Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. (2019). Teacher-student curriculum learning. In IEEE Transactions on Neural Networks and Learning Systems.
- Mockus, J., Tiesis, V., and Zilinskas, A. (1978). The application of bayesian methods for seeking the extremum. Towards global optimization, 2(117-129):2.
- Narvekar, S., Sinapov, J., Leonetti, M., and Stone, P. (2016). Source task creation for curriculum learning. In AAMAS.
- Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179.
- Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., and Boots, B. (2018). Agile autonomous driving using end-to-end deep imitation learning. In RSS.
- Pomerleau, D. (1989). ALVINN: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems.
- Portelas, R., Colas, C., Hofmann, K., and Oudeyer, P.-Y. (2019). Teacher algorithms for curriculum learning of deep RL in continuously parameterized environments. In CoRL.
- Portelas, R., Colas, C., Weng, L., Hofmann, K., and Oudeyer, P.-Y. (2020). Automatic curriculum learning for deep rl: A short survey.
- Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning.
- Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playing – solving sparse reward tasks from scratch. In ICML.
- Scherrer, B. (2014). Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pages 1314–1322.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML.
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Taylor, M. E. and Stone, P. (2005). Behavior transfer for value-function-based reinforcement learning. In Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 53–59.
- Turchetta, M., Berkenkamp, F., and Krause, A. (2016). Safe exploration in finite markov decision processes with gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320.
- Turchetta, M., Berkenkamp, F., and Krause, A. (2019). Safe exploration for interactive machine learning. In Advances in Neural Information Processing Systems, pages 2887–2897.
- Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint arXiv:1810.03548.
- Wang, R., Lehman, J., Clune, J., and Stanley, K. O. (2019). Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR, abs/1901.01753.
- Wu, Y. and Tian, Y. (2017). Training agent for first-person shooter game with actor-critic curriculum learning. In ICLR.

Full Text

Tags

Comments