AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We introduce CISR, a novel framework for safe reinforcement learning that avoids many of the impractical assumptions common in the safe RL literature

Safe Reinforcement Learning via Curriculum Induction

NIPS 2020, (2020)

被引用0|浏览191
EI
下载 PDF 全文
引用
微博一下

摘要

In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situat...更多

代码

数据

0
简介
  • Safety is a major concern that prevents application of reinforcement learning (RL) [42] to many practical problems [14].
  • The authors view a learning agent, which the authors will call a student, as performing constrained RL
  • This framework has been strongly advocated as a promising path to RL safety [37], and expresses safety requirements in terms of an a priori unknown set of feasible safe policies that the student should optimize over.
重点内容
  • Safety is a major concern that prevents application of reinforcement learning (RL) [42] to many practical problems [14]
  • We propose Curriculum Induction for Safe Reinforcement learning (CISR, “Caesar”), a safe RL approach that lifts several prohibitive assumptions of existing ones
  • We view a learning agent, which we will call a student, as performing constrained RL. This framework has been strongly advocated as a promising path to RL safety [37], and expresses safety requirements in terms of an a priori unknown set of feasible safe policies that the student should optimize over. This feasible policy set is often described by a constrained Markov decision process (CMDP) [4]
  • soft reset 1 (SR1) and soft reset 2 (SR2) allow the student to learn about the goal without incurring failures thanks to their reset distribution, which is more forgiving that hard reset (HR)’s one
  • We introduce CISR, a novel framework for safe RL that avoids many of the impractical assumptions common in the safe RL literature
  • We introduce curricula inspired by human learning for safe training and deployment of RL agents and a principled way to optimize them
方法
  • The authors present experiments where CISR efficiently and safely trains deep RL agents in two environments: the Frozen Lake and the Lunar Lander environments from Open AI Gym [8].
  • While Frozen Lake has simple dynamics, it demonstrates how safety exacerbates the difficult problem of exploration in goal-oriented environments.
  • The authors compare students trained with a curriculum optimized by CISR to students trained with trivial or no curricula in terms of safety and sample efficiency.
  • For a detailed overview of the hyperparameters and the environments, see Appendices A and B
结果
  • SR1 and SR2 allow the student to learn about the goal without incurring failures thanks to their reset distribution, which is more forgiving that HR’s one
  • They result in performance plateaus as the consequence of a mistake in the original environment are quite different from those encountered during training.
  • The Optimized curriculum retains the best of both worlds by initially proposing a soft reset intervention that allows the agent to reach the goal and subsequently switching to the hard reset such that the training environment is more similar to the original one.
  • The absence of the teacher results in three orders of magnitude more training failures
结论
  • The authors introduce CISR, a novel framework for safe RL that avoids many of the impractical assumptions common in the safe RL literature.
  • The authors introduce curricula inspired by human learning for safe training and deployment of RL agents and a principled way to optimize them.
  • The authors show how training under such optimized curricula results in performance comparable or superior to training without them, while greatly improving safety
总结
  • Introduction:

    Safety is a major concern that prevents application of reinforcement learning (RL) [42] to many practical problems [14].
  • The authors view a learning agent, which the authors will call a student, as performing constrained RL
  • This framework has been strongly advocated as a promising path to RL safety [37], and expresses safety requirements in terms of an a priori unknown set of feasible safe policies that the student should optimize over.
  • Objectives:

    The authors aim to enable the student to learn a policy for CMDP M without violating any constraints in the process.
  • Methods:

    The authors present experiments where CISR efficiently and safely trains deep RL agents in two environments: the Frozen Lake and the Lunar Lander environments from Open AI Gym [8].
  • While Frozen Lake has simple dynamics, it demonstrates how safety exacerbates the difficult problem of exploration in goal-oriented environments.
  • The authors compare students trained with a curriculum optimized by CISR to students trained with trivial or no curricula in terms of safety and sample efficiency.
  • For a detailed overview of the hyperparameters and the environments, see Appendices A and B
  • Results:

    SR1 and SR2 allow the student to learn about the goal without incurring failures thanks to their reset distribution, which is more forgiving that HR’s one
  • They result in performance plateaus as the consequence of a mistake in the original environment are quite different from those encountered during training.
  • The Optimized curriculum retains the best of both worlds by initially proposing a soft reset intervention that allows the agent to reach the goal and subsequently switching to the hard reset such that the training environment is more similar to the original one.
  • The absence of the teacher results in three orders of magnitude more training failures
  • Conclusion:

    The authors introduce CISR, a novel framework for safe RL that avoids many of the impractical assumptions common in the safe RL literature.
  • The authors introduce curricula inspired by human learning for safe training and deployment of RL agents and a principled way to optimize them.
  • The authors show how training under such optimized curricula results in performance comparable or superior to training without them, while greatly improving safety
表格
  • Table1: Lunar Lander final performance summary. Noiseless, 2-layer student (Left): The Narrow intervention helps exploration but results in policy performance plateau, the Wide one slows down student learning due to making exploration more challenging, and the Optimized teacher provides the best of both by switching between Narrow and Wide. Students that learn under the Optimized curriculum policy achieve a comparable performance to those training under No-intervention, but suffer three orders of magnitude fewer training failures. Noisy student (Center), One layered-student (Right): The results are similar when we use the curriculum optimized for students with noiseless observations and a 2-layer MLP policy for students with noisy sensors (center) or a 1-layer architecture (right), thus showing teaching policies can be transferred across classes of students. left) shows for each curriculum policy the mean of the students’ final return, success rate and failure rate in the original environment and the average number of failures during training. The Narrow intervention makes exploration less challenging but prevents the students from experiencing big portions of the state space. Thus, it results in fast training that plateaus at low success rates. On the contrary, the Wide intervention makes exploration more complex but it is more similar to the original environment. Therefore, it results in slow learning that cannot achieve high success rates withing Ns interaction units. Optimized retains the best of both worlds by initially using the Narrow intervention to speed up learning and subsequently switching to the Wide one. In No-interv., exploration is easier due to the absence of the teacher that can make it hard for the students to experience a natural ending of the episode. Therefore, No-interv. attains a comparable performance to Optimized. However, the absence of the teacher results in three orders of magnitude more training failures
  • Table2: Student’s hyperparameters for the Frozen lake environment
  • Table3: Student’s hyperparameters for the Lunar Lander environment
  • Table4: Mean and variance of the Gamma hyperpriors for the teacher’s hyperparameters for the
  • Table5: Table 5
  • Table6: Final deployment performance in Frozen Lake with confidence intervals obtained by training and evaluating the teachers with three different random seeds. The students trained with the optimized curriculum outperform both naive curricula and training in the original environment in terms of success rate and return. All the agents supervised by a teacher are safe during training. In contrast, training directly in the original environment results in many failures. These results are consistent across random seeds, thus showing the robustness of CISR
  • Table7: Lunar Lander final deployment performance summary for three different kinds of students with confidence intervals obtained by training and evaluating the teachers with three different random seeds. Noiseless, 2-layer student (Top): The Narrow intervention helps exploration but results in policy performance plateau, the Wide one slows down student learning due to making exploration more challenging, and the Optimized teacher provides the best of both by switching between Narrow and Wide. Students that learn under the Optimized curriculum policy achieve a comparable performance to those training under No-intervention, but suffer three orders of magnitude fewer training failures. Noisy student (Center), One layered-student (Bottom): The results are similar when we use the curriculum optimized for students with noiseless observations and a 2-layer MLP policy for students with noisy sensors (center) or a 1-layer architecture (bottom), thus showing teaching policies can be transferred across classes of students. These results are consistent across random seeds, thus showing the robustness of CISR
Download tables as Excel
相关工作
  • CISR is a form of curriculum learning (CL) [36]. CL and learning from demonstration (LfD) [13] are two established classes of approaches that rely on a teacher as an aid in training a decision-making agent, but CISR differs from both. In LfD, a teacher provides demonstrations of a good policy for the task at hand, and the student uses them to learn its own policy by behavior cloning [34], online imitation [32], or apprenticeship learning [1]. In contrast, CISR does not assume that the teacher has a policy for the student’s task at all: e.g., a teacher doesn’t need to know how to ride a bike in order to help a child learn to do it. CL generally relies on a teacher to structure the learning process. A range of works [31, 20, 19, 38, 48, 35, 47] explore ways of building a curriculum by modifying the learning environment. CISR is closer to Graves et al [22], which uses a fixed set of environments for the student and also uses a bandit algorithm for the teacher. CISR’s major differences from existing CL work is that (1) it is the first approach, to our knowledge, that uses CL for ensuring safety and (2) uses multiple students for training the teacher, which allows it to induce curricula in a more data-driven, as opposed to heuristic, way. With regards to safe RL, in addition to the literature mentioned at the beginning, one work that considers the same training and test safety constraints as ours is Le et al [28], which proposes a solver for the student’s CMDP. In that work, the student avoids potentially unsafe environment interaction altogether by learning solely from batch data, which places strong assumptions on MDP dynamics and data collection policy neither verifiable nor easily satisfied in practice [39, 9, 3]. We use the same solver, but in an online setting.
基金
  • This work was supported by the Max Planck ETH Center for Learning Systems
研究对象与分析
students: 30
HR has zero tolerance, τ = 0 and resets the student to the initial state. We compare five different teaching policies: (i) No-intervention, where students learn in the original environment; (ii-iii-iv) single-intervention, where students learn under each of the interventions fixed for the entire learning duration; (v) Optimized, where we use a curriculum policy optimized with CISR over 30 students. We let each of these curriculum policies train 10 students

students: 10
We compare five different teaching policies: (i) No-intervention, where students learn in the original environment; (ii-iii-iv) single-intervention, where students learn under each of the interventions fixed for the entire learning duration; (v) Optimized, where we use a curriculum policy optimized with CISR over 30 students. We let each of these curriculum policies train 10 students. For analysis purposes, we periodically freeze the students’ policies and evaluate them in the original environment

students and teachers: 10
The Optimized curriculum retains the best of both worlds by initially proposing a soft reset intervention that allows the agent to reach the goal and subsequently switching to the hard reset such that the training environment is more similar to the original one. Table 6 in Appendix B shows the confidence intervals of mean performance across 10 students and teachers trained on 3 seeds, indicating CISR’s robustness. Lunar Lander

students: 10
Similarly to the Frozen Lake evaluation, we compare four curriculum policies: (i) No-intervention, (ii-iii) single-intervention and (iv) Optimized. We let each policy train 10 students and we compare their final performance in the original Lunar Lander. Moreover, we use the Optimized curriculum to train different students than those it was optimized for, thus showing the transferability of curricula

引用论文
  • Abbeel, P. and Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML.
    Google ScholarFindings
  • Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
    Findings
  • Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.
    Google ScholarFindings
  • Authors, T. G. (2016). GPyOpt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt.
    Findings
  • Berkenkamp, F., Schoellig, A. P., and Krause, A. (2016). Safe controller optimization for quadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 493–496.
    Google ScholarLocate open access versionFindings
  • Berkenkamp, F., Turchetta, M., Schoellig, A. P., and Krause, A. (2017). Safe model-based reinforcement learning with stability guarantees. In NiPS.
    Google ScholarFindings
  • Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
    Findings
  • Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. ICML.
    Google ScholarLocate open access versionFindings
  • Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. (2018). A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, pages 8092–8101.
    Google ScholarLocate open access versionFindings
  • Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E., and Ghavamzadeh, M. (2019). Lyapunovbased safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031.
    Findings
  • Clement, B., Roy, D., Oudeyer, P.-Y., and Lopes, M. (2015). Multi-armed bandits for intelligent tutoring systems. Journal of Educational Data Mining, 7.
    Google ScholarLocate open access versionFindings
  • D.Argall, B., Chernova, S., Veloso, M., and Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57.
    Google ScholarLocate open access versionFindings
  • Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning.
    Google ScholarFindings
  • El Chamie, M., Yu, Y., and Açıkmese, B. (2016). Convex synthesis of randomized policies for controlled markov chains with density safety upper bound constraints. In 2016 American Control Conference (ACC), pages 6290–6295. IEEE.
    Google ScholarLocate open access versionFindings
  • Eysenbach, B., Gu, S., Ibarz, J., and Levine, S. (2018). Leave no trace: Learning to reset for safe and autonomous reinforcement learning. In ICLR.
    Google ScholarFindings
  • Fachantidis, A., Partalas, I., Tsoumakas, G., and Vlahavas, I. (2013). Transferring task models in reinforcement learning agents. Neurocomputing, 107:23–32.
    Google ScholarLocate open access versionFindings
  • Fernández, F., García, J., and Veloso, M. (2010). Probabilistic policy reuse for inter-task transfer learning. Robotics and Autonomous Systems, 58(7):866–871.
    Google ScholarLocate open access versionFindings
  • Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018). Automatic goal generation for reinforcement learning agents. In ICML.
    Google ScholarFindings
  • Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and Abbeel, P. (2017). Reverse curriculum generation for reinforcement learning. In CoRL.
    Google ScholarFindings
  • Garcıa, J. and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480.
    Google ScholarLocate open access versionFindings
  • Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks. In ICML.
    Google ScholarFindings
  • Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. (2018). Stable baselines. https://github.com/hill-a/stable-baselines.
    Findings
  • Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., Lam, V.-D., Bewley, A., and Shah, A. (2019). Learning to drive in a day. In ICRA.
    Google ScholarFindings
  • Kivinen, J. and Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1–63.
    Google ScholarLocate open access versionFindings
  • Koller, T., Berkenkamp, F., Turchetta, M., and Krause, A. (2018). Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066. IEEE.
    Google ScholarLocate open access versionFindings
  • Lazaric, A. and Restelli, M. (2011). Transfer from multiple mdps. In Advances in Neural Information Processing Systems, pages 1746–1754.
    Google ScholarLocate open access versionFindings
  • Le, H. M., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. arXiv preprint arXiv:1903.08738.
    Findings
  • Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. (2019). Teacher-student curriculum learning. In IEEE Transactions on Neural Networks and Learning Systems.
    Google ScholarLocate open access versionFindings
  • Mockus, J., Tiesis, V., and Zilinskas, A. (1978). The application of bayesian methods for seeking the extremum. Towards global optimization, 2(117-129):2.
    Google ScholarLocate open access versionFindings
  • Narvekar, S., Sinapov, J., Leonetti, M., and Stone, P. (2016). Source task creation for curriculum learning. In AAMAS.
    Google ScholarFindings
  • Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179.
    Google ScholarLocate open access versionFindings
  • Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., and Boots, B. (2018). Agile autonomous driving using end-to-end deep imitation learning. In RSS.
    Google ScholarFindings
  • Pomerleau, D. (1989). ALVINN: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Portelas, R., Colas, C., Hofmann, K., and Oudeyer, P.-Y. (2019). Teacher algorithms for curriculum learning of deep RL in continuously parameterized environments. In CoRL.
    Google ScholarFindings
  • Portelas, R., Colas, C., Weng, L., Hofmann, K., and Oudeyer, P.-Y. (2020). Automatic curriculum learning for deep rl: A short survey.
    Google ScholarFindings
  • Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning.
    Google ScholarFindings
  • Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playing – solving sparse reward tasks from scratch. In ICML.
    Google ScholarLocate open access versionFindings
  • Scherrer, B. (2014). Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pages 1314–1322.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
    Findings
  • Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML.
    Google ScholarFindings
  • Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
    Google ScholarFindings
  • Taylor, M. E. and Stone, P. (2005). Behavior transfer for value-function-based reinforcement learning. In Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 53–59.
    Google ScholarLocate open access versionFindings
  • Turchetta, M., Berkenkamp, F., and Krause, A. (2016). Safe exploration in finite markov decision processes with gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320.
    Google ScholarLocate open access versionFindings
  • Turchetta, M., Berkenkamp, F., and Krause, A. (2019). Safe exploration for interactive machine learning. In Advances in Neural Information Processing Systems, pages 2887–2897.
    Google ScholarLocate open access versionFindings
  • Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint arXiv:1810.03548.
    Findings
  • Wang, R., Lehman, J., Clune, J., and Stanley, K. O. (2019). Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR, abs/1901.01753.
    Findings
  • Wu, Y. and Tian, Y. (2017). Training agent for first-person shooter game with actor-critic curriculum learning. In ICLR.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科