AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented a novel k-step adjacency constraint for goal-conditioned hierarchical reinforcement learning framework to address the issue of training inefficiency, with the theoretical guarantee of preserving the optimal policy in deterministic Markov Decision Process

Generating Adjacency-Constrained Subgoals in Hierarchical Reinforcement Learning

NIPS 2020, (2020)

Cited by: 1|Views184
EI
Full Text
Bibtex
Weibo

Abstract

Goal-conditioned hierarchical reinforcement learning (HRL) is a promising approach for scaling up reinforcement learning (RL) techniques. However, it often suffers from training inefficiency as the action space of the high-level, i.e., the goal space, is often large. Searching in a large goal space poses difficulties for both high-level...More
0
Introduction
  • Hierarchical reinforcement learning (HRL) has shown great potentials in scaling up reinforcement learning (RL) methods to tackle large, temporally extended problems with long-term credit assignment and sparse rewards [31, 24, 2].
  • As the subgoals can be interpreted as high-level actions, it is feasible to directly train the high-level policy to generate subgoals using the external rewards as supervision, which has been widely adopted by previous researches [20, 19, 16, 14, 34]
  • These methods require little taskspecific design, they often suffer from training inefficiency.
  • The low-level training suffers as the agent tries to reach every possible subgoal produced by the high-level policy
Highlights
  • Hierarchical reinforcement learning (HRL) has shown great potentials in scaling up reinforcement learning (RL) methods to tackle large, temporally extended problems with long-term credit assignment and sparse rewards [31, 24, 2]
  • We have presented our method of Hierarchical Reinforcement learning with k-step Adjacency Constraint (HRAC)
  • We presented a novel k-step adjacency constraint for goal-conditioned hierarchical reinforcement learning (HRL) framework to address the issue of training inefficiency, with the theoretical guarantee of preserving the optimal policy in deterministic Markov Decision Process (MDP)
  • We show that the proposed adjacency constraint can be practically implemented with an adjacency network
  • This work may promote the researches in the field of HRL and RL, and has potential real-world applications such as the robotics
  • Experimental results on discrete and continuous control tasks show that our method outperforms the state-of-the-art HRL approaches
  • Since the training data of RL heavily depend on the training environments, designing unbiased simulators or real-world training environments is important for eliminating the biases in the data collected by the agent
Results
  • Experimental results on discrete and continuous control tasks show that the method outperforms the state-of-the-art HRL approaches.
Conclusion
  • The authors presented a novel k-step adjacency constraint for goal-conditioned HRL framework to address the issue of training inefficiency, with the theoretical guarantee of preserving the optimal policy in deterministic MDPs. The authors show that the proposed adjacency constraint can be practically implemented with an adjacency network.
  • Future work include extending the proposed framework to tasks with high-dimensional state spaces and leveraging the learned adjacency network to improve learning efficiency in more general scenarios.
  • This work may promote the researches in the field of HRL and RL, and has potential real-world applications such as the robotics.
  • Since the training data of RL heavily depend on the training environments, designing unbiased simulators or real-world training environments is important for eliminating the biases in the data collected by the agent
Summary
  • Introduction:

    Hierarchical reinforcement learning (HRL) has shown great potentials in scaling up reinforcement learning (RL) methods to tackle large, temporally extended problems with long-term credit assignment and sparse rewards [31, 24, 2].
  • As the subgoals can be interpreted as high-level actions, it is feasible to directly train the high-level policy to generate subgoals using the external rewards as supervision, which has been widely adopted by previous researches [20, 19, 16, 14, 34]
  • These methods require little taskspecific design, they often suffer from training inefficiency.
  • The low-level training suffers as the agent tries to reach every possible subgoal produced by the high-level policy
  • Results:

    Experimental results on discrete and continuous control tasks show that the method outperforms the state-of-the-art HRL approaches.
  • Conclusion:

    The authors presented a novel k-step adjacency constraint for goal-conditioned HRL framework to address the issue of training inefficiency, with the theoretical guarantee of preserving the optimal policy in deterministic MDPs. The authors show that the proposed adjacency constraint can be practically implemented with an adjacency network.
  • Future work include extending the proposed framework to tasks with high-dimensional state spaces and leveraging the learned adjacency network to improve learning efficiency in more general scenarios.
  • This work may promote the researches in the field of HRL and RL, and has potential real-world applications such as the robotics.
  • Since the training data of RL heavily depend on the training environments, designing unbiased simulators or real-world training environments is important for eliminating the biases in the data collected by the agent
Tables
  • Table1: Hyper-parameters used in discrete control tasks. “K-C” in the table refers to “Key-Chest”
  • Table2: Table 2
  • Table3: Hyper-parameters used in adjacency network training
Download tables as Excel
Related work
  • How to effectively learn policies with multiple hierarchies has been a long-standing topic in RL. Goal-conditioned HRL [3, 30, 14, 34, 20, 16] aims to answer this question with a framework that separates high-level planning and low-level control using subgoals. Recent advances in goalconditioned HRL mainly focus on improving the learning efficiency of this framework. Nachum et al [20, 19] propose an off-policy correction technique to stabilize training, and address the problem of goal space representation learning using a mutual-information-based objective. However, the subgoal generation process in their approaches is unconstrained and supervised only by the external reward, and thus they may still suffer from training inefficiency. Levy et al [16] use hindsight techniques [1] to train multi-level policies in parallel and also punish the high-level for generating subgoals that the low-level fails to reach. However, they directly obtain the reachability measure from the environment, using the environmental information that is not available in many scenarios. There are also researches [17, 13, 15, 27, 25, 12] focusing on unsupervised acquisition of subgoals based on potentially pivotal states. However, these subgoals are not guaranteed to be well-aligned with downstream tasks and thus are often sub-optimal.
Funding
  • Acknowledgments and Disclosure of Funding This work was supported in part by the National Natural Science Foundation of China under Grant 61671266, Grant 61836004, and in part by the Tsinghua-Guoqiang research program under Grant 2019GQG0006
Study subjects and analysis
data: 1
Obviously, if these policies are diverse enough, we can effectively approximate the shortest transition distance with a sufficiently large n. However, training a set of diverse policies separately is costly, while using one single policy to approximate the policy set (n = 1) [27, 28] often leads to non-optimality. To handle this difficulty, we exploit the fact that the low-level policy itself changes over time during the training procedure

Reference
  • Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003.
    Google ScholarLocate open access versionFindings
  • Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in Neural Information Processing Systems, 1993.
    Google ScholarLocate open access versionFindings
  • Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-Explore: A new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Findings
  • Ben Eysenbach, Russ R Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Carlos Florensa, Jonas Degrave, Nicolas Heess, Jost Tobias Springenberg, and Martin Riedmiller. Selfsupervised learning of image embedding for continuous control. arXiv preprint arXiv:1901.00943, 2019.
    Findings
  • Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, and Sergey Levine. Dynamical distance learning for semi-supervised and unsupervised skill discovery. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Zhiao Huang, Fangchen Liu, and Hao Su. Mapping state space using landmarks for universal goal reaching. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Özgür Simsek, Alicia P. Wolfe, and Andrew G. Barto. Identifying useful subgoals in reinforcement learning by local graph partitioning. In ICML, 2005.
    Google ScholarLocate open access versionFindings
  • Tejas D. Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Tejas D. Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J. Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.
    Findings
  • Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierarchies with hindsight. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 2001.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for hierarchical reinforcement learning. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Soroush Nasiriany, Vitchyr H. Pong, Steven Lin, and Sergey Levine. Planning with goal-conditioned policies. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Model-free deep RL for model-based control. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Doina Precup. Temporal abstraction in reinforcement learning. PhD thesis, University of Massachusetts, Amherst, 2000.
    Google ScholarFindings
  • Jacob Rafati and David C Noelle. Unsupervised methods for subgoal discovery during intrinsic motivation in model-free hierarchical reinforcement learning. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Zhizhou Ren, Kefan Dong, Yuan Zhou, Qiang Liu, and Jian Peng. Exploration via hindsight goal generation. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber and Reiner Wahnsiedler. Planning simple trajectories using neural subgoal generators. In From Animals to Animats 2: Proceedings of the Second International Conference on Simulation of Adaptive Behavior, volume 2, page 196. MIT Press, 1993.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Tom Van de Wiele, David Warde-Farley, Andriy Mnih, and Volodymyr Mnih. Q-learning in enormous action spaces via amortized approximate maximization. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Mannor. Learn what not to learn: Action elimination with deep reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科