AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We propose a policy design that decomposes into primitives, to hierarchical reinforcement learning, but without a high-level meta-policy

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

ICLR, (2020)

被引用6|浏览343
EI
下载 PDF 全文
引用
微博一下

摘要

Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that t...更多

代码

数据

0
简介
  • Learning policies that generalize to new environments or tasks is a fundamental challenge in reinforcement learning.
  • While deep reinforcement learning has enabled training powerful policies, which outperform humans on specific, well-defined tasks [24], their performance often diminishes when the properties of the environment or the task change to regimes not encountered during training
  • This is in stark contrast to how humans learn, plan, and act: humans can seamlessly switch between different aspects of a task, transfer knowledge to new tasks from remotely related but essentially distinct prior experience, and combine primitives used for distinct aspects of different tasks in meaningful ways to solve new problems.
重点内容
  • Learning policies that generalize to new environments or tasks is a fundamental challenge in reinforcement learning
  • While deep reinforcement learning has enabled training powerful policies, which outperform humans on specific, well-defined tasks [24], their performance often diminishes when the properties of the environment or the task change to regimes not encountered during training
  • Since humans seem to benefit from learning skills and learning to combine skills, it might be a useful inductive bias for the learning models as well. This is addressed to some extent by the hierarchical reinforcement learning (HRL) methods, which focus on learning representations at multiple spatial and temporal scales, enabling better exploration strategies and improved generalization performance [9, 36, 10, 19]
  • Contributions: In summary, the contributions of our work are as follows: (1) We propose a method for learning and operating a set of functional primitives in a fully decentralized way, without requiring a high-level meta-controller to select active primitives
方法
  • Flat Policy (PPO) Option critic 3 goals.
  • 11 ± 5 % 18 ± 10 % 10 goals MLSH 32 ± 3 % 5 ± 3 %.
  • Explicit high level policy 21 ± 5 % 11 ± 2 % Proposed method
结果
  • The authors briefly outline the tasks that the authors used to evaluate the proposed method and direct the reader to the appendix for the complete details of each task along with the hyperparameters used for the model.
  • The authors compare the proposed method to the following baselines: a) Option Critic [4] – The authors extended the author’s implementation 2 of the Option Critic architecture and experimented with multiple variations in the terms of hyperparameters and state/goal encoding.
  • None of these yielded reasonable performance in partially observed tasks, so the authors omit it from the results.
结论
  • The authors present a framework for learning an ensemble of primitive policies which can collectively solve tasks in a decentralized fashion.
  • Rather than relying on a centralized, learned meta-controller, the selection of active primitives is implemented through an information-theoretic mechanism.
  • On Minigrid, the authors show how primitives trained with the method can transfer much more successfully to new tasks and on the ant maze, the authors show that primitives initialized from a pretrained walking control can learn to walk to different goals in a stochastic, multi-modal environment with nearly double the success rate of a more conventional hierarchical RL approach, which uses the same pretraining but a centralized high-level policy.
  • Thereby, the already learned primitives would keep their focus on particular aspects of the task, and newly added ones could specialize on novel aspects
总结
  • Introduction:

    Learning policies that generalize to new environments or tasks is a fundamental challenge in reinforcement learning.
  • While deep reinforcement learning has enabled training powerful policies, which outperform humans on specific, well-defined tasks [24], their performance often diminishes when the properties of the environment or the task change to regimes not encountered during training
  • This is in stark contrast to how humans learn, plan, and act: humans can seamlessly switch between different aspects of a task, transfer knowledge to new tasks from remotely related but essentially distinct prior experience, and combine primitives used for distinct aspects of different tasks in meaningful ways to solve new problems.
  • Methods:

    Flat Policy (PPO) Option critic 3 goals.
  • 11 ± 5 % 18 ± 10 % 10 goals MLSH 32 ± 3 % 5 ± 3 %.
  • Explicit high level policy 21 ± 5 % 11 ± 2 % Proposed method
  • Results:

    The authors briefly outline the tasks that the authors used to evaluate the proposed method and direct the reader to the appendix for the complete details of each task along with the hyperparameters used for the model.
  • The authors compare the proposed method to the following baselines: a) Option Critic [4] – The authors extended the author’s implementation 2 of the Option Critic architecture and experimented with multiple variations in the terms of hyperparameters and state/goal encoding.
  • None of these yielded reasonable performance in partially observed tasks, so the authors omit it from the results.
  • Conclusion:

    The authors present a framework for learning an ensemble of primitive policies which can collectively solve tasks in a decentralized fashion.
  • Rather than relying on a centralized, learned meta-controller, the selection of active primitives is implemented through an information-theoretic mechanism.
  • On Minigrid, the authors show how primitives trained with the method can transfer much more successfully to new tasks and on the ant maze, the authors show that primitives initialized from a pretrained walking control can learn to walk to different goals in a stochastic, multi-modal environment with nearly double the success rate of a more conventional hierarchical RL approach, which uses the same pretraining but a centralized high-level policy.
  • Thereby, the already learned primitives would keep their focus on particular aspects of the task, and newly added ones could specialize on novel aspects
表格
  • Table1: Hyperparameters
Download tables as Excel
相关工作
  • There are a wide variety of hierarchical reinforcement learning approaches[34, 9, 10]. One of the most widely applied HRL framework is the Options framework ([36]). An option can be thought of as an action that extends over multiple timesteps thus providing the notion of temporal abstraction or subroutines in an MDP. Each option has its own policy (which is followed if the option is selected) and the termination condition (to stop the execution of that option). Many strategies are proposed for discovering options using task-specific hierarchies, such as pre-defined sub-goals [16], hand-designed features [12], or diversity-promoting priors [8, 11]. These approaches do not generalize well to new tasks. [4] proposed an approach to learn options in an end-to-end manner by parameterizing the intra-option policy as well as the policy and termination condition for all the options. Eigen-options [21] use the eigenvalues of the Laplacian (for the transition graph induced by the MDP) to derive an intrinsic reward for discovering options as well as learning an intra-option policy.
基金
  • The authors are grateful to NSERC, CIFAR, Google, Samsung, Nuance, IBM, Canada Research Chairs, Canada Graduate Scholarship Program, Nvidia for funding, and Compute Canada for computing resources
引用论文
  • Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations through noise. CoRR, abs/1611.01353, 2016. URL http://arxiv.org/abs/1611.01353.
    Findings
  • Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
    Findings
  • Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 166–175. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
    Google ScholarLocate open access versionFindings
  • Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.
    Findings
  • Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    Findings
  • Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012.
    Google ScholarLocate open access versionFindings
  • Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
    Google ScholarLocate open access versionFindings
  • Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
    Google ScholarLocate open access versionFindings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
    Findings
  • Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
    Findings
  • K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta Learning Shared Hierarchies. arXiv e-prints, October 2017.
    Google ScholarFindings
  • Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017.
    Findings
  • Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018.
    Findings
  • Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
    Findings
  • Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989–2998, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
    Google ScholarLocate open access versionFindings
  • Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics, 36(3), 2017.
    Google ScholarLocate open access versionFindings
  • Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017.
    Findings
  • Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJfYvo09Y7.
    Locate open access versionFindings
  • Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJl6TjRcY7.
    Locate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961, 2017.
    Findings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
    Google ScholarLocate open access versionFindings
  • Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 (4):41:1–41:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073602. URL http://doi.acm.org/10.1145/3072959.3073602.
    Locate open access versionFindings
  • Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Exampleguided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http://doi.acm.org/10.1145/3197517.3201311.
    Locate open access versionFindings
  • Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019.
    Findings
  • Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
    Findings
  • John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
    Findings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.
    Google ScholarFindings
  • Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057– 1063, Cambridge, MA, USA, 1999. MIT Press. URL http://dl.acm.org/citation.cfm?id=3009657.3009806.
    Locate open access versionFindings
  • Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999.
    Google ScholarLocate open access versionFindings
  • Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
    Google ScholarFindings
  • Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000. URL http://arxiv.org/abs/physics/0004057.
    Locate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html.
    Locate open access versionFindings
  • Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696.
    Locate open access versionFindings
  • Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017.
    Google ScholarLocate open access versionFindings
  • 1. Can our proposed approach learn primitives that remain active when training the agent over a sequence of tasks?
    Google ScholarFindings
  • 2. Can our proposed approach be used to improve the sample efficiency of the agent over a sequence of tasks?
    Google ScholarFindings
  • 1. All the models (proposed as well as the baselines) are implemented in Pytorch 1.1 unless stated otherwise. [27].
    Google ScholarFindings
  • 2. For Meta-Learning Shared Hierarchies [14] and Option-Critic [4], we adapted the author’s implementations 5for our environments.
    Google ScholarFindings
  • 3. During the evaluation, we use 10 processes in parallel to run 500 episodes and compute the percentage of times the agent solves the task within the prescribed time limit. This metric is referred to as the “success rate”.
    Google ScholarFindings
  • 4. The default time limit is 500 steps for all the tasks unless specified otherwise.
    Google ScholarFindings
  • 5. All the feedforward networks are initialized with the orthogonal initialization where the input tensor is filled with a (semi) orthogonal matrix.
    Google ScholarFindings
  • 6. For all the embedding layers, the weights are initialized using the unit-Gaussian distribution.
    Google ScholarFindings
  • 7. The weigh√ts an√d biases for from U (− k, k) where all the GRU model k
    Google ScholarFindings
  • 8. During training, we perform 64 rollouts in parallel to collect 5-step trajectories.
    Google ScholarFindings
  • 9. The βind and βreg parameters are both selected from the set {0.001, 0.005, 0.009} by performing validation.
    Google ScholarFindings
  • 3. Size of the input to the first layer and the second layer of the policy network are 320 and 64 respectively.
    Google ScholarFindings
  • 4. Produces a scalar output.
    Google ScholarFindings
作者
Shagun Sodhani
Shagun Sodhani
Xue Bin Peng
Xue Bin Peng
您的评分 :
0

 

标签
评论
小科