Meta Learning Shared Hierarchies

international conference on learning representations, 2018.

Cited by: 145|Bibtex|Views126
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
The set of sub-policies could be condensed into a single neural network, which receives a continuous vector from the master policy

Abstract:

We develop a metalearning approach for learning hierarchically structured poli- cies, improving sample efficiency on unseen tasks through the use of shared primitives—policies that are executed for large numbers of timesteps. Specifi- cally, a set of primitives are shared within a distribution of tasks, and are switched between by task-sp...More

Code:

Data:

0
Introduction
  • Humans encounter a wide variety of tasks throughout their lives and utilize prior knowledge to master new tasks quickly.
  • One challenge is that while the authors want to share information between the different tasks, these tasks have different optimal policies, so it’s suboptimal to learn a single shared policy for all tasks
  • Addressing this challenge, the authors propose a model containing a set of shared sub-policies, which are switched between by task-specific master policies.
  • This design is closely related to the options framework (Sutton et al, 1999; Bacon et al, 2016), but applied to the setting of a task distribution.
  • The authors propose a method for the end-to-end training of sub-policies that allow for quick learning on new tasks, handled solely by learning a master policy
Highlights
  • Humans encounter a wide variety of tasks throughout their lives and utilize prior knowledge to master new tasks quickly
  • One challenge is that while we want to share information between the different tasks, these tasks have different optimal policies, so it’s suboptimal to learn a single shared policy for all tasks. Addressing this challenge, we propose a model containing a set of shared sub-policies, which are switched between by task-specific master policies
  • We propose a method for the end-to-end training of sub-policies that allow for quick learning on new tasks, handled solely by learning a master policy
  • While there are various possible architectures incorporating shared parameters φ and per-task parameters θ, we propose an architecture that is motivated by the ideas of hierarchical reinforcement learning
  • The set of sub-policies could be condensed into a single neural network, which receives a continuous vector from the master policy
  • We believe this work opens up many directions in training agents that can quickly adapt to new tasks
Methods
  • The authors hypothesize that meaningful sub-policies can be learned by operating over distributions of tasks, in an efficient enough manner to handle complex physics domains.
  • The authors present a series of experiments designed to test the performance of the method, through comparison to baselines and past methods with hierarchy
Conclusion
  • The authors formulate an approach for the end-to-end metalearning of hierarchical policies.
  • As there is no gradient signal being passed between the master and sub-policies, the MLSH model utilizes hard one-hot communication, as opposed to methods such as Gumbel-Softmax (Jang et al, 2016).
  • While the authors used policy gradients in the experiments, it is entirely feasible to have the master or sub-policies be trained with evolution (Eigen) or Q-learning (Watkins & Dayan, 1992)
  • From another point of view, the training framework can be seen as a method of joint optimization over two sets of parameters.
  • The authors believe this work opens up many directions in training agents that can quickly adapt to new tasks
Summary
  • Introduction:

    Humans encounter a wide variety of tasks throughout their lives and utilize prior knowledge to master new tasks quickly.
  • One challenge is that while the authors want to share information between the different tasks, these tasks have different optimal policies, so it’s suboptimal to learn a single shared policy for all tasks
  • Addressing this challenge, the authors propose a model containing a set of shared sub-policies, which are switched between by task-specific master policies.
  • This design is closely related to the options framework (Sutton et al, 1999; Bacon et al, 2016), but applied to the setting of a task distribution.
  • The authors propose a method for the end-to-end training of sub-policies that allow for quick learning on new tasks, handled solely by learning a master policy
  • Methods:

    The authors hypothesize that meaningful sub-policies can be learned by operating over distributions of tasks, in an efficient enough manner to handle complex physics domains.
  • The authors present a series of experiments designed to test the performance of the method, through comparison to baselines and past methods with hierarchy
  • Conclusion:

    The authors formulate an approach for the end-to-end metalearning of hierarchical policies.
  • As there is no gradient signal being passed between the master and sub-policies, the MLSH model utilizes hard one-hot communication, as opposed to methods such as Gumbel-Softmax (Jang et al, 2016).
  • While the authors used policy gradients in the experiments, it is entirely feasible to have the master or sub-policies be trained with evolution (Eigen) or Q-learning (Watkins & Dayan, 1992)
  • From another point of view, the training framework can be seen as a method of joint optimization over two sets of parameters.
  • The authors believe this work opens up many directions in training agents that can quickly adapt to new tasks
Related work
  • Previous work in hierarchical reinforcement learning seeks to speed up the learning process by recombining a set of temporally extended primitives—the most well-known formulation is Options (Sutton et al, 1999). While the earliest work assumed that these options are given, more recent work seeks to learn them automatically (Vezhnevets et al, 2016; Daniel et al, 2016). Heess et al (2016) discovers primitives by training over a set of simple tasks. Florensa et al (2017) learns a master policy, where sub-policies are defined according to information-maximizing statistics. Bacon et al (2016) introduces end-to-end learning of hierarchy through the options framework. Henderson et al (2017) extends the options framework to include reward options. Several methods (Dayan & Hinton, 1993; Vezhnevets et al, 2017; Ghazanfari & Taylor, 2017) aim to learn a decomposition of complicated tasks into sub-goals. These prior works are mostly focused on the single-task setting and don’t account for the multi-task structure as part of the algorithm. Other past works (Thomas & Barto, 2011; Thomas, 2011; Thomas & Barto, 2012) have simultaneously learned modules that are used in conjunction to solve tasks, but do not incorporate temporal abstraction. On the other hand, our work takes advantage of the multi-task setting as a way to learn temporally extended primitives.
Reference
  • Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems 29, 2016.
    Google ScholarLocate open access versionFindings
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. arXiv preprint arXiv:1609.05140, 2016.
    Findings
  • Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. Probabilistic inference for determining options in reinforcement learning. Mach. Learn., 2016.
    Google ScholarFindings
  • Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in Neural Information Processing Systems, 1993.
    Google ScholarLocate open access versionFindings
  • Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint 1611.02779, 2016.
    Findings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchial reinforcement learning. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Behzad Ghazanfari and Matthew E. Taylor. Autonomous extracting a hierarchical structure of tasks in reinforcement learning and multi-task reinforcement learning. arXiv preprint 1709.04579, 2017.
    Findings
  • Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, Martin Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
    Findings
  • Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon, David Meger, Joelle Pineau, and Doina Precup. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. arXiv preprint 1709.06683, 2017.
    Findings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint 1611.01144, 2016.
    Findings
  • Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Meta-learning with temporal convolutions. arXiv preprint 1707.03141, 2017.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint 1707.06347, 2017.
    Findings
  • Richard S Sutton, Doina Precup,, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. In Artificial intelligence, 1999.
    Google ScholarLocate open access versionFindings
  • Philip S Thomas. Policy gradient coagent networks. In Advances in Neural Information Processing Systems, pp. 1944–1952, 2011.
    Google ScholarLocate open access versionFindings
  • Philip S Thomas and Andrew G Barto. Conjugate markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 137–144, 2011.
    Google ScholarLocate open access versionFindings
  • Philip S Thomas and Andrew G Barto. Motor primitive discovery. In Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on, pp. 1–8. IEEE, 2012.
    Google ScholarFindings
  • E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012.
    Google ScholarLocate open access versionFindings
  • Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, and Koray Kavukcuoglu. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint 1703.01161, 2017.
    Findings
  • Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn. arXiv preprint 1611.05763, 2016.
    Findings
  • Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments