# Novelty Search in representational space for sample efficient exploration

NIPS 2020, 2020.

EI

Weibo:

Abstract:

We present a new approach for efficient exploration which leverages a low-dimensional encoding of the environment learned with a combination of model-based and model-free objectives. Our approach uses intrinsic rewards that are based on the distance of nearest neighbors in the low dimensional representational space to gauge novelty. We ...More

Introduction

- In order to solve a task efficiently in Reinforcement Learning (RL), one of the main challenges is to gather informative experiences via an efficient exploration of the state space.
- An issue occurs when measuring novelty directly from the raw observations, as some information in pixel space may be irrelevant.
- In this case, if an agent wants to efficiently explore its state space it should only focus on meaningful and novel information

Highlights

- In order to solve a task efficiently in Reinforcement Learning (RL), one of the main challenges is to gather informative experiences via an efficient exploration of the state space
- We formulate the task of dynamics learning in Model-based RL (MBRL) through the Information Bottleneck principle
- We present methods to optimize the Information Bottleneck (IB) equation through low-dimensional abstract representations of state
- We further develop a novelty score based on these learnt representations that we leverage as an intrinsic reward that enables efficient exploration
- By using this novelty score with a combination of model-based and model-free approaches for planning, we show more efficient exploration across multiple environments with our learnt representations and novelty rewards
- The model can over-generalize with the consequence that the low-dimensional representation loses information that is crucial for the exploration of the entire state space

Methods

- The authors conduct experiments on environments of varying difficulty. All experiments use a training scheme where the authors first train parameters to converge on an accurate representation of the already experienced transitions before taking an environment step.
- The authors consider two 21 × 21 versions of the grid-world environment (Figure 6 in Appendix).
- The second is a similar sized grid-world split into four connected rooms.
- In these environments the action space A is the set of four cardinal directions.
- The authors use two metrics to gauge exploration for this environment: the first is the ratio of states visited only once, the second is the proportion of total states visited

Conclusion

- The authors formulate the task of dynamics learning in MBRL through the Information Bottleneck principle.
- A possible solution to this problem would be to use some sampling scheme to sample a fixed number of observations for calculation of the novelty heuristic.
- Another issue that has arisen from using very low-dimensional space to represent state is generalization.
- With the theory and methods developed in this paper, the authors hope to see future work done on larger tasks with more complex environment dynamics

Summary

## Introduction:

In order to solve a task efficiently in Reinforcement Learning (RL), one of the main challenges is to gather informative experiences via an efficient exploration of the state space.- An issue occurs when measuring novelty directly from the raw observations, as some information in pixel space may be irrelevant.
- In this case, if an agent wants to efficiently explore its state space it should only focus on meaningful and novel information
## Objectives:

The authors' goal is to learn an optimally predictive model of the environment. While Bellemare et al (2016) solves this issue with density estimation using pseudo-counts directly from the high-dimensional observations, the authors aim to estimate some function of novelty in the learnt lower-dimensional representation space.## Methods:

The authors conduct experiments on environments of varying difficulty. All experiments use a training scheme where the authors first train parameters to converge on an accurate representation of the already experienced transitions before taking an environment step.- The authors consider two 21 × 21 versions of the grid-world environment (Figure 6 in Appendix).
- The second is a similar sized grid-world split into four connected rooms.
- In these environments the action space A is the set of four cardinal directions.
- The authors use two metrics to gauge exploration for this environment: the first is the ratio of states visited only once, the second is the proportion of total states visited
## Conclusion:

The authors formulate the task of dynamics learning in MBRL through the Information Bottleneck principle.- A possible solution to this problem would be to use some sampling scheme to sample a fixed number of observations for calculation of the novelty heuristic.
- Another issue that has arisen from using very low-dimensional space to represent state is generalization.
- With the theory and methods developed in this paper, the authors hope to see future work done on larger tasks with more complex environment dynamics

- Table1: Number of environment steps necessary to reach the goal state in the Acrobot and the multistep maze environments (lower is better). Results are averaged over 5 trials for both experiments. Best results are in bold

Related work

- The proposed exploration strategy falls under the category of directed exploration (Thrun, 1992) that makes use of the past interactions with the environment to guide the discovery of new states. This work is inspired by the Novelty Search algorithm (Lehman and Stanley, 2011) that uses a nearest-neighbor scoring approach to gauge novelty in policy space. Our approach leverages this scoring to traverse dynamics space, which we motivate theoretically. Exploration strategies have been investigated with both model-free and model-based approaches. In Bellemare et al (2016) and Ostrovski et al (2017), a model-free algorithm provides the notion of novelty through a pseudocount from an arbitrary density model that provides an estimate of how many times an action has been taken in similar states. Recently, Taiga et al (2020) do a thorough comparison between bonusbased exploration methods in model-free RL and show that architectural changes may be more important to agent performance (based on extrinsic rewards) as opposed to differing exploration strategies. Instead of solely focusing on reward-based metrics o(as most other works on exploration in RL do), we focus instead on measuring an agent’s ability to explore through the number of steps taken to reach a given hard-to-reach state in the model-based setting with orders of magnitude less steps as compared to model-free methods.

Study subjects and analysis

observations: 100

Illustrations of the Acrobot and multi-step goal maze environments. a) Left: The Acrobot environment in one configuration of its start state. a) Right: One configuration of the ending state of the Acrobot environment. The environment finishes when the second arm passes the solid black line. b) Left: The passageway to the west portion of the environment are blocked before the key (black) is collected. b) Right: The passageway is traversable after collecting the key, and the reward (red) is then available. The environment terminates after collecting the reward. a) Visualization for 100 observations (4 frames per observation) of Montezuma’s Revenge game play. Representation learnt was nX = 5 and visualized with t-SNE (van der Maaten and Hinton, 2008) in 2 dimensions. Labels on top-left of game frames correspond to labels of states in lower-dimensional space. Transitions are shown by shaded lines. b) Original resized game frames visualized using t-SNE with the same parameters. We show preliminary results for learning abstract representations for a more complex task, Montezuma’s Revenge. Comparing the two figures above, we observe how temporally closer states are closer together in lower-dimensional learnt representational space as compared to pixel space. Transitions are not shown for raw observations. An example of the state counts of our agent in the open labyrinth with d = 5 step planning. Titles of each subplot denotes the number of steps taken. The brightness of the points are proportional to the state visitation count. The bright spots that begins after 200 counts is the agent requiring a few trials for learning the dynamics of labyrinth walls

Reference

- Abel, D., Salvatier, J., Stuhlmuller, A., and Evans, O. (2017). Agent-agnostic human-in-the-loop reinforcement learning. arXiv preprint arXiv:1701.04079.
- Achiam, J. and Sastry, S. (2017). Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning. arXiv e-prints, page arXiv:1703.01732.
- Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868.
- Borodachov, S., Hardin, D., and Saff, E. (2019). Discrete Energy on Rectifiable Sets.
- Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
- Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. (2018a). Large-Scale Study of Curiosity-Driven Learning. arXiv e-prints, page arXiv:1808.04355.
- Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018b). Exploration by Random Network Distillation. arXiv e-prints, page arXiv:1810.12894.
- Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning.
- Chentanez, N., Barto, A. G., and Singh, S. P. (2005). Intrinsically motivated reinforcement learning. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17, pages 1281–1288. MIT Press.
- Chiappa, S., Racaniere, S., Wierstra, D., and Mohamed, S. (2017). Recurrent environment simulators. arXiv preprint arXiv:1704.02254.
- Cover, T. and Thomas, J. (2012). Elements of Information Theory. Wiley.
- Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624.
- de Bruin, T., Kober, J., Tuyls, K., and Babuska, R. (2018). Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1394–1401.
- Francois-Lavet, V., Bengio, Y., Precup, D., and Pineau, J. (2018). Combined reinforcement learning via abstract representations. CoRR, abs/1809.04506.
- Garcia, C. E., Prett, D. M., and Morari, M. (1989). Model predictive control: Theory and practice a survey. Autom., 25:335–348.
- Garcıa, J. and Fernandez, F. (2015). A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16(1):1437–1480.
- Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. (2019). Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736.
- Gregor, K., Rezende, D. J., and Wierstra, D. (2016). Variational intrinsic control. arXiv preprint arXiv:1611.07507.
- Ha, D. and Schmidhuber, J. (2018). Recurrent World Models Facilitate Policy Evolution. arXiv e-prints, page arXiv:1809.01999.
- Haber, N., Mrowca, D., Fei-Fei, L., and Yamins, D. L. (2018). Learning to play with intrinsicallymotivated self-aware agents. arXiv preprint arXiv:1802.07442.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2018). Learning Latent Dynamics for Planning from Pixels. arXiv e-prints, page arXiv:1811.04551.
- Hester, T. and Stone, P. (2012). Intrinsically motivated model learning for a developing curious agent. 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pages 1–6.
- Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117.
- Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99 – 134.
- Lehman, J. and Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary Computation, 19(2):189–223. PMID: 20868264.
- Loftsgaarden, D. O. and Quesenberry, C. P. (1965). A nonparametric estimate of a multivariate density function. Ann. Math. Statist., 36(3):1049–1051.
- Mandel, T., Liu, Y.-E., Brunskill, E., and Popovic, Z. (2017). Where to add actions in human-inthe-loop reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence.
- Mohamed, S. and Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133.
- Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems, pages 6118–6128.
- Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep Exploration via Bootstrapped DQN. arXiv e-prints, page arXiv:1602.04621.
- Osband, I., Van Roy, B., Russo, D., and Wen, Z. (2017). Deep Exploration via Randomized Value Functions. arXiv e-prints, page arXiv:1703.07608.
- Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos, R. (2017). Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310.
- Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiosity-driven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition.
- Salge, C., Glackin, C., and Polani, D. (2014). Changing the environment based on empowerment as intrinsic motivation. Entropy, 16(5):2789–2819.
- Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. (2018). Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274.
- Schmidhuber, J. (1990). A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the First International Conference on Simulation of Adaptive Behavior on From Animals to Animats, pages 222–227, Cambridge, MA, USA. MIT Press.
- Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247.
- Shyam, P., Jaskowski, W., and Gomez, F. (2018). Model-Based Active Exploration. arXiv e-prints, page arXiv:1810.12162.
- Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. (2016). The Predictron: End-To-End Learning and Planning. arXiv e-prints, page arXiv:1612.08810.
- Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814.
- Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. arXiv e-prints, page arXiv:1507.00814.
- Stepleton, T. (2017). The pycolab game engine. https://github.com/deepmind/pycolab.
- Still, S. (2009). Information-theoretic approach to interactive learning. EPL (Europhysics Letters), 85:28005.
- Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112(1–2):181–211.
- Taiga, A. A., Fedus, W., Machado, M. C., Courville, A., and Bellemare, M. G. (2020). On bonus based exploration methods in the arcade learning environment. In International Conference on Learning Representations.
- Tamar, A., Levine, S., Abbeel, P., WU, Y., and Thomas, G. (2016). Value iteration networks. In Advances in Neural Information Processing Systems, pages 2146–2154.
- Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning. arXiv e-prints, page arXiv:1611.04717.
- Thrun, S. B. (1992). Efficient exploration in reinforcement learning.
- Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv e-prints, page physics/0004057. van der Maaten, L. and Hinton, G. (2008). Viualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605.
- van Hasselt, H., Guez, A., and Silver, D. (2015). Deep Reinforcement Learning with Double Qlearning. arXiv e-prints, page arXiv:1509.06461.
- Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere.
- Similarly to previous works (e.g. Oh et al., 2017; Chebotar et al., 2017), we use a combination of model-based planning with model-free Q-learning to obtain a good policy. We calculate rollout estimates of next states based on our transition model τand sum up the corresponding rewards, which we denote as r: X × A → [0, Rmax] and can be a combination of both intrinsic and extrinsic rewards. We calculate expected returns based on the discounted rewards of our d-depth rollouts:

Full Text

Tags

Comments