Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills

Campos Víctor
Campos Víctor
Trott Alexander
Trott Alexander
Giro-i-Nieto Xavier
Giro-i-Nieto Xavier
Torres Jordi
Torres Jordi

ICML, pp. 1317-1327, 2020.

Cited by: 0|Bibtex|Views52
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
Our experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors

Abstract:

Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest fr...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) algorithms have recently achieved outstanding goals thanks to advances in simulation (Todorov et al, 2012; Bellemare et al, 2013), efficient and scalable learning algorithms (Mnih et al, 2015; Lillicrap et al, 2016; Schulman et al, 2015; 2017; Espeholt et al, 2018), function approximation (LeCun et al, 2015; Good-

    Preprint.
  • Training aims to solve a particular task, relying on task-specific reward functions to measure progress and drive learning.
  • This contrasts with how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment.
  • In RL, analogous “unsupervised” methods are often aimed at learning generically useful behaviors for interacting within some environment, behaviors that may naturally accelerate learning once one or more downstream tasks become available
Highlights
  • Reinforcement learning (RL) algorithms have recently achieved outstanding goals thanks to advances in simulation (Todorov et al, 2012; Bellemare et al, 2013), efficient and scalable learning algorithms (Mnih et al, 2015; Lillicrap et al, 2016; Schulman et al, 2015; 2017; Espeholt et al, 2018), function approximation (LeCun et al, 2015; Good-

    Preprint
  • Since p(s) is fixed, Equation 7 can be maximized in a reinforcement learning-styled setup with the reward function r(s, z ) = log qφ(s|z ), z ∼ p(z) where qφ(s|z) is given by the decoder of the Variational Autoencoders trained on the skill discovery stage
  • We provide theoretical and empirical evidence that poor state space coverage is a predominant failure mode of existing skill discovery methods
  • The information-theoretic objective requires access to unknown distributions, which these methods approximate with those induced by the policy. These approximations lead to pathological training dynamics where the agent obtains larger rewards by visiting already discovered states rather than exploring the environment
  • We propose Explore, Discover and Learn (EDL), a novel option discovery approach that leverages a fixed distribution over states and variational inference techniques to break the dependency on the distributions induced by the policy
  • Our experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors
Methods
  • EDL not straightforward.
  • Previous methods perform exploration and skill discovery at the same time, so that modifying p(z) inevitably involves exploring the environment from scratch.
  • Skill learning.
  • The final stage consists in training a policy πθ(s, z) that maximizes the mutual information between states and latent variables.
  • EDL adopts the forward form of the mutual information.
  • Since p(s) is fixed, Equation 7 can be maximized in a reinforcement learning-styled setup with the reward function r(s, z ) = log qφ(s|z ), z ∼ p(z)
Conclusion
  • The authors provide theoretical and empirical evidence that poor state space coverage is a predominant failure mode of existing skill discovery methods.
  • The authors propose Explore, Discover and Learn (EDL), a novel option discovery approach that leverages a fixed distribution over states and variational inference techniques to break the dependency on the distributions induced by the policy.
  • This alternative approach optimizes the same objective derived from information theory used in previous methods.
  • The authors' experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors
Summary
  • Introduction:

    Reinforcement learning (RL) algorithms have recently achieved outstanding goals thanks to advances in simulation (Todorov et al, 2012; Bellemare et al, 2013), efficient and scalable learning algorithms (Mnih et al, 2015; Lillicrap et al, 2016; Schulman et al, 2015; 2017; Espeholt et al, 2018), function approximation (LeCun et al, 2015; Good-

    Preprint.
  • Training aims to solve a particular task, relying on task-specific reward functions to measure progress and drive learning.
  • This contrasts with how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment.
  • In RL, analogous “unsupervised” methods are often aimed at learning generically useful behaviors for interacting within some environment, behaviors that may naturally accelerate learning once one or more downstream tasks become available
  • Methods:

    EDL not straightforward.
  • Previous methods perform exploration and skill discovery at the same time, so that modifying p(z) inevitably involves exploring the environment from scratch.
  • Skill learning.
  • The final stage consists in training a policy πθ(s, z) that maximizes the mutual information between states and latent variables.
  • EDL adopts the forward form of the mutual information.
  • Since p(s) is fixed, Equation 7 can be maximized in a reinforcement learning-styled setup with the reward function r(s, z ) = log qφ(s|z ), z ∼ p(z)
  • Conclusion:

    The authors provide theoretical and empirical evidence that poor state space coverage is a predominant failure mode of existing skill discovery methods.
  • The authors propose Explore, Discover and Learn (EDL), a novel option discovery approach that leverages a fixed distribution over states and variational inference techniques to break the dependency on the distributions induced by the policy.
  • This alternative approach optimizes the same objective derived from information theory used in previous methods.
  • The authors' experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors
Tables
  • Table1: Types of methods depending on the considered generative model and the version of the mutual information (MI) being maximized. Distributions denoted by ρ are induced by running the policy in the environment, whereas p is used for the true and potentially unknown ones. The dependency of existing methods on ρπ(s|z) causes pathological training dynamics by letting the agent influence over the states considered in the optimization process. EDL relies on a fixed distribution over states p(s) to break this dependency and makes use of variational inference (VI) techniques to model p(s|z) and p(z|s)
  • Table2: Environment details
  • Table3: Hyperparameters used in the experiments. Values between brackets were used in the grid search, and tuned independently for each method
  • Table4: Hyperparameters used for exploration using SMM. Values between brackets were used in the grid search, and tuned independently for each environment. Training ends once the buffer is full
  • Table5: Hyperparameters used for training the VQ-VAE in the skill discovery stage. Values between brackets were used in the grid search, and tuned independently for each environment and exploration method
Download tables as Excel
Related work
Funding
  • This work was partially supported by the Spanish Ministry of Science and Innovation and the European Regional Development Fund under contracts TEC2016-75976-R and TIN2015-65316-P, by the BSC-CNS Severo Ochoa program SEV-2015-0493, and grant 2017-SGR-1414 by Generalitat de Catalunya
  • Vıctor Campos was supported by Obra Social “la Caixa” through La Caixa-Severo Ochoa International Doctoral Fellowship program
Reference
  • Achiam, J., Edwards, H., Amodei, D., and Abbeel, P. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018.
    Findings
  • Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
    Findings
  • Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • Barber, D. and Agakov, F. V. The IM algorithm: a variational approach to information maximization. In NIPS, 2003.
    Google ScholarLocate open access versionFindings
  • Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Kuttler, H., Lefrancq, A., Green, S., Valdes, V., Sadik, A., et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
    Findings
  • Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
    Google ScholarLocate open access versionFindings
  • Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
    Findings
  • Capdevila, J., Cerquides, J., and Torres, J. Mining urban events from the tweet stream through a probabilistic mixture model. Data mining and knowledge discovery, 2018.
    Google ScholarFindings
  • Chou, P.-W., Maturana, D., and Scherer, S. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Conti, E., Madhavan, V., Such, F. P., Lehman, J., Stanley, K. O., and Clune, J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv preprint arXiv:1712.06560, 2017.
    Findings
  • Cully, A., Clune, J., Tarapore, D., and Mouret, J.-B. Robots that can adapt like animals. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. Go-explore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Findings
  • Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Florensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. Self-supervised learning of image embedding for continuous control. arXiv preprint arXiv:1901.00943, 2019.
    Findings
  • Fox, R., Krishnan, S., Stoica, I., and Goldberg, K. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
    Findings
  • Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
    Google ScholarFindings
  • Gregor, K., Rezende, D. J., and Wierstra, D. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
    Findings
  • Guss, W. H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S., Liebana, D. P., Salakhutdinov, R., Topin, N., et al. The minerl competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079, 2019.
    Findings
  • Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
    Findings
  • Henaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. v. d. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.
    Findings
  • Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., DulacArnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
    Findings
  • Ho, J. and Ermon, S. Generative adversarial imitation learning. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. Vime: Variational information maximizing exploration. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Human-level performance in 3d multiplayer games with populationbased reinforcement learning. Science, 2019.
    Google ScholarLocate open access versionFindings
  • Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In ISCA, 2017.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 2015.
    Google ScholarLocate open access versionFindings
  • Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., and Salakhutdinov, R. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019.
    Findings
  • Lehman, J. and Stanley, K. O. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 2011a.
    Google ScholarFindings
  • Lehman, J. and Stanley, K. O. Novelty search and the problem with objectives. In Genetic programming theory and practice IX. Springer, 2011b.
    Google ScholarLocate open access versionFindings
  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Liu, H., Trott, A., Socher, R., and Xiong, C. Competitive experience replay. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019.
    Findings
  • Meyerson, E., Lehman, J., and Miikkulainen, R. Learning behavior characterizations for novelty search. In GECCO, 2016.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • Mohamed, S. and Rezende, D. J. Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Mouret, J.-B. and Clune, J. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015.
    Findings
  • Nachum, O., Gu, S. S., Lee, H., and Levine, S. Data-efficient hierarchical reinforcement learning. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • NVIDIA. Nvidia tesla v100 gpu architecture. 2017.
    Google ScholarFindings
  • Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learning. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Osband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Parr, R. and Russell, S. J. Reinforcement learning with hierarchies of machines. In NIPS, 1998.
    Google ScholarLocate open access versionFindings
  • Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
    Findings
  • Precup, D. Temporal abstraction in reinforcement learning. 2001.
    Google ScholarFindings
  • Pugh, J. K., Soros, L. B., and Stanley, K. O. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 2016.
    Google ScholarLocate open access versionFindings
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
    Google ScholarLocate open access versionFindings
  • Salge, C., Glackin, C., and Polani, D. Empowerment – an introduction. In Guided Self-Organization: Inception. Springer, 2014.
    Google ScholarLocate open access versionFindings
  • Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Sharma, A., Gu, S., Levine, S., Kumar, V., and Hausman, K. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
    Findings
  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 2017.
    Google ScholarLocate open access versionFindings
  • Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., and Fergus, R. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
    Findings
  • Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.
    Google ScholarFindings
  • Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O. X., Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. # exploration: A study of count-based exploration for deep reinforcement learning. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IROS, 2012.
    Google ScholarLocate open access versionFindings
  • Trott, A., Zhen, S., Xiong, C., and Socher, R. Sibling rivalry. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Vecerık, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothorl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
    Findings
  • Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 2019.
    Google ScholarLocate open access versionFindings
  • Warde-Farley, D., Van de Wiele, T., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Maximum reward for know states. As observed by Sharma et al. (2019), this reward function encourages skills to be predictable (i.e. ρπ(s|z ) → 1) and diverse (i.e. ρπ(s|zi) → 0, ∀zi = z ): Reward for previously unseen states. Note that ρπ(z|s) is not defined for unseen states, and we will assume a uniform prior over skills in this undefined scenario, ρπ(z|s) = 1/N, ∀z: rnew = log N + log N = 0 (15)
    Google ScholarLocate open access versionFindings
  • Alternatively, one could add a background class to the model in order to assign null probability to unseen states (Capdevila et al., 2018). This differs from the setup in previous works, reason why it was considered in the analysis. However, note that the agent gets a larger penalization for visiting rmax = log 1 + log N = log N
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments