# Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills

ICML, pp. 1317-1327, 2020.

EI

Weibo:

Abstract:

Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest fr...More

Code:

Data:

Introduction

- Reinforcement learning (RL) algorithms have recently achieved outstanding goals thanks to advances in simulation (Todorov et al, 2012; Bellemare et al, 2013), efficient and scalable learning algorithms (Mnih et al, 2015; Lillicrap et al, 2016; Schulman et al, 2015; 2017; Espeholt et al, 2018), function approximation (LeCun et al, 2015; Good-

Preprint. - Training aims to solve a particular task, relying on task-specific reward functions to measure progress and drive learning.
- This contrasts with how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment.
- In RL, analogous “unsupervised” methods are often aimed at learning generically useful behaviors for interacting within some environment, behaviors that may naturally accelerate learning once one or more downstream tasks become available

Highlights

- Reinforcement learning (RL) algorithms have recently achieved outstanding goals thanks to advances in simulation (Todorov et al, 2012; Bellemare et al, 2013), efficient and scalable learning algorithms (Mnih et al, 2015; Lillicrap et al, 2016; Schulman et al, 2015; 2017; Espeholt et al, 2018), function approximation (LeCun et al, 2015; Good-

Preprint - Since p(s) is fixed, Equation 7 can be maximized in a reinforcement learning-styled setup with the reward function r(s, z ) = log qφ(s|z ), z ∼ p(z) where qφ(s|z) is given by the decoder of the Variational Autoencoders trained on the skill discovery stage
- We provide theoretical and empirical evidence that poor state space coverage is a predominant failure mode of existing skill discovery methods
- The information-theoretic objective requires access to unknown distributions, which these methods approximate with those induced by the policy. These approximations lead to pathological training dynamics where the agent obtains larger rewards by visiting already discovered states rather than exploring the environment
- We propose Explore, Discover and Learn (EDL), a novel option discovery approach that leverages a fixed distribution over states and variational inference techniques to break the dependency on the distributions induced by the policy
- Our experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors

Methods

- EDL not straightforward.
- Previous methods perform exploration and skill discovery at the same time, so that modifying p(z) inevitably involves exploring the environment from scratch.
- Skill learning.
- The final stage consists in training a policy πθ(s, z) that maximizes the mutual information between states and latent variables.
- EDL adopts the forward form of the mutual information.
- Since p(s) is fixed, Equation 7 can be maximized in a reinforcement learning-styled setup with the reward function r(s, z ) = log qφ(s|z ), z ∼ p(z)

Conclusion

- The authors provide theoretical and empirical evidence that poor state space coverage is a predominant failure mode of existing skill discovery methods.
- The authors propose Explore, Discover and Learn (EDL), a novel option discovery approach that leverages a fixed distribution over states and variational inference techniques to break the dependency on the distributions induced by the policy.
- This alternative approach optimizes the same objective derived from information theory used in previous methods.
- The authors' experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors

Summary

## Introduction:

Reinforcement learning (RL) algorithms have recently achieved outstanding goals thanks to advances in simulation (Todorov et al, 2012; Bellemare et al, 2013), efficient and scalable learning algorithms (Mnih et al, 2015; Lillicrap et al, 2016; Schulman et al, 2015; 2017; Espeholt et al, 2018), function approximation (LeCun et al, 2015; Good-

Preprint.- Training aims to solve a particular task, relying on task-specific reward functions to measure progress and drive learning.
- This contrasts with how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment.
- In RL, analogous “unsupervised” methods are often aimed at learning generically useful behaviors for interacting within some environment, behaviors that may naturally accelerate learning once one or more downstream tasks become available
## Methods:

EDL not straightforward.- Previous methods perform exploration and skill discovery at the same time, so that modifying p(z) inevitably involves exploring the environment from scratch.
- Skill learning.
- The final stage consists in training a policy πθ(s, z) that maximizes the mutual information between states and latent variables.
- EDL adopts the forward form of the mutual information.
- Since p(s) is fixed, Equation 7 can be maximized in a reinforcement learning-styled setup with the reward function r(s, z ) = log qφ(s|z ), z ∼ p(z)
## Conclusion:

The authors provide theoretical and empirical evidence that poor state space coverage is a predominant failure mode of existing skill discovery methods.- The authors propose Explore, Discover and Learn (EDL), a novel option discovery approach that leverages a fixed distribution over states and variational inference techniques to break the dependency on the distributions induced by the policy.
- This alternative approach optimizes the same objective derived from information theory used in previous methods.
- The authors' experiments suggest that EDL discovers a meaningful latent space for skills even when tasked with learning a discrete set of options, whose latent codes can be combined in order to produce a richer set of behaviors

- Table1: Types of methods depending on the considered generative model and the version of the mutual information (MI) being maximized. Distributions denoted by ρ are induced by running the policy in the environment, whereas p is used for the true and potentially unknown ones. The dependency of existing methods on ρπ(s|z) causes pathological training dynamics by letting the agent influence over the states considered in the optimization process. EDL relies on a fixed distribution over states p(s) to break this dependency and makes use of variational inference (VI) techniques to model p(s|z) and p(z|s)
- Table2: Environment details
- Table3: Hyperparameters used in the experiments. Values between brackets were used in the grid search, and tuned independently for each method
- Table4: Hyperparameters used for exploration using SMM. Values between brackets were used in the grid search, and tuned independently for each environment. Training ends once the buffer is full
- Table5: Hyperparameters used for training the VQ-VAE in the skill discovery stage. Values between brackets were used in the grid search, and tuned independently for each environment and exploration method

Related work

- Option discovery. Temporally-extended high-level primitives, also known as options, are an important resource in the RL toolbox (Parr & Russell, 1998; Sutton et al, 1999; Precup, 2001). The process of defining options involves task-specific knowledge, which might be difficult to acquire and has motivated research towards methods that automatically discover such options. These include learning options while solving the desired task (Bacon et al, 2017), leveraging demonstrations (Fox et al, 2017), training goal-oriented low-level policies (Nachum et al, 2018), and meta-learning primitives from a distribution of related tasks (Frans et al, 2018). Skills discovered by information-theoretic methods such as the ones considered in this work have also been used as primitives for Hierarchical RL (Florensa et al, 2017; Eysenbach et al, 2019; Sharma et al, 2019).

Intrinsic rewards. Agents need to encounter a reward before they can start learning, but this process might become highly inefficient in sparse reward setups when relying on standard exploration techniques (Osband et al, 2016). This issue can be alleviated by introducing intrinsic rewards, i.e. denser reward signals that can be automatically computed. These rewards are generally task-agnostic and might come from state visitation pseudo-counts (Bellemare et al, 2016; Tang et al, 2017), unsupervised control tasks (Jaderberg et al, 2017), learning to predict environment dynamics (Houthooft et al, 2016; Pathak et al, 2017; Burda et al, 2018), self-imitation (Oh et al, 2018), and self-play (Sukhbaatar et al, 2017; Liu et al, 2019).

Funding

- This work was partially supported by the Spanish Ministry of Science and Innovation and the European Regional Development Fund under contracts TEC2016-75976-R and TIN2015-65316-P, by the BSC-CNS Severo Ochoa program SEV-2015-0493, and grant 2017-SGR-1414 by Generalitat de Catalunya
- Vıctor Campos was supported by Obra Social “la Caixa” through La Caixa-Severo Ochoa International Doctoral Fellowship program

Reference

- Achiam, J., Edwards, H., Amodei, D., and Abbeel, P. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018.
- Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay. In NIPS, 2017.
- Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In AAAI, 2017.
- Barber, D. and Agakov, F. V. The IM algorithm: a variational approach to information maximization. In NIPS, 2003.
- Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Kuttler, H., Lefrancq, A., Green, S., Valdes, V., Sadik, A., et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
- Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016.
- Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
- Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- Capdevila, J., Cerquides, J., and Torres, J. Mining urban events from the tweet stream through a probabilistic mixture model. Data mining and knowledge discovery, 2018.
- Chou, P.-W., Maturana, D., and Scherer, S. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In ICML, 2017.
- Conti, E., Madhavan, V., Such, F. P., Lehman, J., Stanley, K. O., and Clune, J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv preprint arXiv:1712.06560, 2017.
- Cully, A., Clune, J., Tarapore, D., and Mouret, J.-B. Robots that can adapt like animals. Nature, 2015.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. Go-explore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995, 2019.
- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In ICML, 2018.
- Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. In ICLR, 2019.
- Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In ICLR, 2017.
- Florensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. Self-supervised learning of image embedding for continuous control. arXiv preprint arXiv:1901.00943, 2019.
- Fox, R., Krishnan, S., Stoica, I., and Goldberg, K. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
- Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. In ICLR, 2018.
- Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
- Gregor, K., Rezende, D. J., and Wierstra, D. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
- Guss, W. H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S., Liebana, D. P., Salakhutdinov, R., Topin, N., et al. The minerl competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079, 2019.
- Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In ICML, 2017.
- Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In ICML, 2019.
- He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Henaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. v. d. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.
- Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., DulacArnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
- Ho, J. and Ermon, S. Generative adversarial imitation learning. In NIPS, 2016.
- Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. Vime: Variational information maximizing exploration. In NIPS, 2016.
- Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
- Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Human-level performance in 3d multiplayer games with populationbased reinforcement learning. Science, 2019.
- Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In ISCA, 2017.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2014.
- Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
- LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 2015.
- Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., and Salakhutdinov, R. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019.
- Lehman, J. and Stanley, K. O. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 2011a.
- Lehman, J. and Stanley, K. O. Novelty search and the problem with objectives. In Genetic programming theory and practice IX. Springer, 2011b.
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In ICLR, 2016.
- Liu, H., Trott, A., Socher, R., and Xiong, C. Competitive experience replay. In ICLR, 2019.
- Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019.
- Meyerson, E., Lehman, J., and Miikkulainen, R. Learning behavior characterizations for novelty search. In GECCO, 2016.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 2015.
- Mohamed, S. and Rezende, D. J. Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015.
- Mouret, J.-B. and Clune, J. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015.
- Nachum, O., Gu, S. S., Lee, H., and Levine, S. Data-efficient hierarchical reinforcement learning. In NeurIPS, 2018.
- NVIDIA. Nvidia tesla v100 gpu architecture. 2017.
- Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learning. In ICML, 2018.
- Osband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions. In ICML, 2016.
- Parr, R. and Russell, S. J. Reinforcement learning with hierarchies of machines. In NIPS, 1998.
- Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
- Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
- Precup, D. Temporal abstraction in reinforcement learning. 2001.
- Pugh, J. K., Soros, L. B., and Stanley, K. O. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 2016.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
- Salge, C., Glackin, C., and Polani, D. Empowerment – an introduction. In Guided Self-Organization: Inception. Springer, 2014.
- Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In ICML, 2015.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML, 2015.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sharma, A., Gu, S., Levine, S., Kumar, V., and Hausman, K. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 2017.
- Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., and Fergus, R. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
- Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.
- Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O. X., Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. # exploration: A study of count-based exploration for deep reinforcement learning. In NIPS, 2017.
- Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IROS, 2012.
- Trott, A., Zhen, S., Xiong, C., and Socher, R. Sibling rivalry. In NeurIPS, 2019.
- van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In NeurIPS, 2017.
- Vecerık, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothorl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
- Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 2019.
- Warde-Farley, D., Van de Wiele, T., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. In ICLR, 2019.
- Maximum reward for know states. As observed by Sharma et al. (2019), this reward function encourages skills to be predictable (i.e. ρπ(s|z ) → 1) and diverse (i.e. ρπ(s|zi) → 0, ∀zi = z ): Reward for previously unseen states. Note that ρπ(z|s) is not defined for unseen states, and we will assume a uniform prior over skills in this undefined scenario, ρπ(z|s) = 1/N, ∀z: rnew = log N + log N = 0 (15)
- Alternatively, one could add a background class to the model in order to assign null probability to unseen states (Capdevila et al., 2018). This differs from the setup in previous works, reason why it was considered in the analysis. However, note that the agent gets a larger penalization for visiting rmax = log 1 + log N = log N

Full Text

Tags

Comments