RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments

ICLR, 2020.

被引用0|浏览13
EI
微博一下
We propose Rewarding Impact-Driven Exploration, an intrinsic reward bonus that encourages agents to explore actions that substantially change the state of the environment, as measured in a learned latent space

摘要

Exploration in sparse reward environments remains one of the key challenges of model-free reinforcement learning. Instead of solely relying on extrinsic rewards provided by the environment, many state-of-the-art methods use intrinsic rewards to encourage exploration. However, we show that existing methods fall short in procedurally-genera...更多
0
ZH
下载 PDF 全文
引用
微博一下
简介
重点内容
  • Deep reinforcement learning (RL) is one of the most popular frameworks for developing agents that can solve a wide range of complex tasks (Mnih et al, 2016; Silver et al, 2016; 2017)
  • We propose Rewarding Impact-Driven Exploration (RIDE), a novel intrinsic reward for exploration in reinforcement learning that encourages the agent to take actions which result in impactful changes to its representation of the environment state
  • We present the results of Rewarding Impact-Driven Exploration in comparison to popular exploration methods, as well as an analysis of the learned policies and properties of the intrinsic reward generated by different methods. 6.1 MINIGRID Figure 3 summarizes our results on various hard MiniGrid tasks
  • While other exploration bonuses seem effective on easier tasks and are able to learn optimal policies where IMPALA fails, the gap between our approach and the others is increasing with the difficulty of the task
  • We propose Rewarding Impact-Driven Exploration (RIDE), an intrinsic reward bonus that encourages agents to explore actions that substantially change the state of the environment, as measured in a learned latent space
  • Rewarding Impact-Driven Exploration has a number of desirable properties: it attracts agents to states where they can affect the environment, it provides a signal to agents even after training for a long time, and it is conceptually simple as well as compatible with other intrinsic or extrinsic rewards and any deep reinforcement learning algorithm
方法
  • The authors evaluate RIDE on procedurally-generated environments from MiniGrid, as well as on two existing singleton environments with high-dimensional observations used in prior work, and compare it against both standard RL and three commonly used intrinsic reward methods for exploration.
  • The first set of environments are procedurally-generated gridworlds in MiniGrid (Chevalier-Boisvert et al, 2018).
  • The authors consider three types of hard exploration tasks: MultiRoomNXSY, KeyCorridorS3R3, and ObstructedMaze2Dlh. In MiniGrid, the world is a partially observable grid of size N × N.
  • More details about the MiniGrid environment and tasks can be found in A.3
结果
  • RESULTS AND DISCUSSION

    The authors present the results of RIDE in comparison to popular exploration methods, as well as an analysis of the learned policies and properties of the intrinsic reward generated by different methods. 6.1 MINIGRID Figure 3 summarizes the results on various hard MiniGrid tasks.
  • The authors' results reveal that RIDE is more sample efficient compared to all the other exploration methods across all MiniGrid tasks considered here.
  • While other exploration bonuses seem effective on easier tasks and are able to learn optimal policies where IMPALA fails, the gap between the approach and the others is increasing with the difficulty of the task.
  • RIDE manages to solve some very challenging tasks on which the other methods fail to get any reward even after training on over 100M frames (Figure 3).
结论
  • CONCLUSION AND FUTURE

    WORK

    In this work, the authors propose Rewarding Impact-Driven Exploration (RIDE), an intrinsic reward bonus that encourages agents to explore actions that substantially change the state of the environment, as measured in a learned latent space.
  • One can make use of symbolic information to measure or characterize the agent's impact, consider longer-term effects of the agent's actions, or promote diversity among the kinds of changes the agent makes to the environment.
  • Another interesting avenue for future research is to develop algorithms that can distinguish between desirable and undesirable types of impact the agent can have in the environment, constraining the agent to act safely and avoid distractions.
  • The different kinds of impact might correspond to distinctive skills or low-level policies that a hierarchical controller could use to learn more complex policies or better exploration strategies
总结
  • Introduction:

    Deep reinforcement learning (RL) is one of the most popular frameworks for developing agents that can solve a wide range of complex tasks (Mnih et al, 2016; Silver et al, 2016; 2017).
  • The use of intrinsic motivation has been proposed to encourage agents to learn about their environments even when extrinsic feedback is rarely provided (Schmidhuber, 1991b; 2010; Oudeyer et al, 2007; Oudeyer & Kaplan, 2009)
  • This type of exploration bonus emboldens the agent to visit new states (Bellemare et al, 2016; Burda et al, 2019b; Ecoffet et al, 2019) or to improve its knowledge and forward prediction of the world dynamics (Pathak et al, 2017; Burda et al, 2019a), and can be highly effective for learning in hard exploration games such as Montezuma's Revenge (Mnih et al, 2016).
  • Methods:

    The authors evaluate RIDE on procedurally-generated environments from MiniGrid, as well as on two existing singleton environments with high-dimensional observations used in prior work, and compare it against both standard RL and three commonly used intrinsic reward methods for exploration.
  • The first set of environments are procedurally-generated gridworlds in MiniGrid (Chevalier-Boisvert et al, 2018).
  • The authors consider three types of hard exploration tasks: MultiRoomNXSY, KeyCorridorS3R3, and ObstructedMaze2Dlh. In MiniGrid, the world is a partially observable grid of size N × N.
  • More details about the MiniGrid environment and tasks can be found in A.3
  • Results:

    RESULTS AND DISCUSSION

    The authors present the results of RIDE in comparison to popular exploration methods, as well as an analysis of the learned policies and properties of the intrinsic reward generated by different methods. 6.1 MINIGRID Figure 3 summarizes the results on various hard MiniGrid tasks.
  • The authors' results reveal that RIDE is more sample efficient compared to all the other exploration methods across all MiniGrid tasks considered here.
  • While other exploration bonuses seem effective on easier tasks and are able to learn optimal policies where IMPALA fails, the gap between the approach and the others is increasing with the difficulty of the task.
  • RIDE manages to solve some very challenging tasks on which the other methods fail to get any reward even after training on over 100M frames (Figure 3).
  • Conclusion:

    CONCLUSION AND FUTURE

    WORK

    In this work, the authors propose Rewarding Impact-Driven Exploration (RIDE), an intrinsic reward bonus that encourages agents to explore actions that substantially change the state of the environment, as measured in a learned latent space.
  • One can make use of symbolic information to measure or characterize the agent's impact, consider longer-term effects of the agent's actions, or promote diversity among the kinds of changes the agent makes to the environment.
  • Another interesting avenue for future research is to develop algorithms that can distinguish between desirable and undesirable types of impact the agent can have in the environment, constraining the agent to act safely and avoid distractions.
  • The different kinds of impact might correspond to distinctive skills or low-level policies that a hierarchical controller could use to learn more complex policies or better exploration strategies
表格
  • Table1: Mean intrinsic reward per action over 100 episodes on a random maze in MultiRoomN7S4
  • Table2: Hyperparameters common to all experiments
  • Table3: Mean intrinsic reward per action computed over 100 episodes on a random map from MultiRoomN12S10
  • Table4: Mean intrinsic reward per action computed over 100 episodes on a random map from ObstructedMaze2Dlh
  • Table5: Average return over 100 episodes on a version of MultiRoomN7S4 in which the colors of the walls and goals change with each episode. The models were trained until convergence on a set of 4 colors and tested on a held-out set of 2 colors. A.9 OTHER PRACTICAL INSIGHTS While developing this work, we also experimented with a few other variations of RIDE that did not work. First, we tried to use observations instead of learned state embeddings for computing the RIDE reward, but this was not able to solve any of the tasks. Using a common state representation for both the policy and the embeddings also proved to be ineffective
Download tables as Excel
相关工作
基金
  • Proposes a novel type of intrinsic reward which encourages the agent to take actions that lead to significant changes in its learned state representation
  • Evaluates our method on multiple challenging procedurally-generated tasks in MiniGrid, as well as on tasks with high-dimensional observations used in prior work
  • Demonstrates that current exploration methods fall short in such environments as they make strong assumptions about the environment , make strong assumptions about the state space , or provide intrinsic rewards that can diminish quickly during training
  • Proposes Rewarding Impact-Driven Exploration , a novel intrinsic reward for exploration in RL that encourages the agent to take actions which result in impactful changes to its representation of the environment state
  • Our experiments show that RIDE outperforms state-of-the-art exploration methods, in procedurally-generated environments
引用论文
  • Joshua Achiam and Shankar Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. CoRR, abs/1703.01732, 2017. URL http://arxiv.org/abs/1703.01732.
    Findings
  • Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pp. 2930–2941, 2018.
    Google ScholarLocate open access versionFindings
  • Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016. URL http://arxiv.org/abs/1612.03801.
    Findings
  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Yuri Burda, Harrison Edwards, Deepak Pathak, Amos J. Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019a. URL https://openreview.net/forum?id=rJNwDjAqYX.
    Locate open access versionFindings
  • Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019b. URL https://openreview.net/forum?id= H1lJJnR5Ym.
    Locate open access versionFindings
  • Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.
    Findings
  • Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingency-aware exploration in reinforcement learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 201URL https://openreview.net/forum?id=HyxGB2AcY7.
    Locate open access versionFindings
  • Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.07289.
    Findings
  • Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 1282–1289, 2019. URL http://proceedings.mlr.press/v97/cobbe19a.html.
    Locate open access versionFindings
  • Nat Dilokthanakul, Christos Kaplanis, Nick Pawlowski, and Murray Shanahan. Feature control as intrinsic motivation for hierarchical reinforcement learning. IEEE transactions on neural networks and learning systems, 2019.
    Google ScholarLocate open access versionFindings
  • Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Findings
  • Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1406–1415, 2018. URL http://proceedings.mlr.press/v80/espeholt18a.html.
    Locate open access versionFindings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL https://openreview.net/forum?id=SJx63jRqFm.
    Locate open access versionFindings
  • John Foley, Emma Tosch, Kaleigh Clary, and David Jensen. Toybox: Better atari environments for testing reinforcement learning agents. CoRR, abs/1812.02850, 2018. URL http://arxiv.org/abs/1812.02850.
    Findings
  • Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=rywHCPkAW.
    Locate open access versionFindings
  • Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Hugo Larochelle, Matthew Botvinick, Yoshua Bengio, and Sergey Levine. Infobot: Transfer and exploration via the information bottleneck. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL https://openreview.net/forum?id=rJg8yhAqKm.
    Locate open access versionFindings
  • Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, 2017. URL https://openreview.net/forum?id= Skc-Fo4Yg.
    Locate open access versionFindings
  • Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum?id=SJ6yPD5xg.
    Locate open access versionFindings
  • Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Humanlevel performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
    Google ScholarLocate open access versionFindings
  • Yacine Jernite, Kavya Srinet, Jonathan Gray, and Arthur Szlam. Craftassist instruction parsing: Semantic parsing for a minecraft assistant. CoRR, abs/1905.01978, 2019. URL http://arxiv.org/abs/1905.01978.
    Findings
  • Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pp. 4246–4247, 2016. URL http://www.ijcai.org/Abstract/16/643.
    Locate open access versionFindings
  • Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 2684–2691, 2019. doi: 10.24963/ijcai.2019/373. URL https://doi.org/10.24963/ijcai.2019/373.
    Locate open access versionFindings
  • Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729, 2018.
    Findings
  • Christian Kauten. Super Mario Bros for OpenAI Gym. GitHub, 2018. URL https://github.com/Kautenja/gym-super-mario-bros.
    Locate open access versionFindings
  • Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Heinrich Küttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rocktäschel, and Edward Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL. arXiv preprint arXiv:1910.03552, 2019. URL https://github.com/facebookresearch/torchbeast.
    Findings
  • Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
    Findings
  • Timothée Lesort, Natalia Díaz Rodríguez, Jean-François Goudou, and David Filliat. State representation learning for control: An overview. Neural Networks, 108:379–392, 2018. doi: 10.1016/j. neunet.2018.07.006. URL https://doi.org/10.1016/j.neunet.2018.07.006.
    Locate open access versionFindings
  • Daniel Ying-Jeh Little and Friedrich Tobias Sommer. Learning and exploration in action-perception loops. Frontiers in neural circuits, 7:37, 2013.
    Google ScholarFindings
  • Marlos C Machado, Marc G Bellemare, and Michael Bowling. Count-based exploration with the successor representation. arXiv preprint arXiv:1807.11622, 2018a.
    Findings
  • Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018b.
    Google ScholarLocate open access versionFindings
  • Kenneth Marino, Abhinav Gupta, Rob Fergus, and Arthur Szlam. Hierarchical RL using an ensemble of proprioceptive periodic policies. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL https://openreview.net/forum?id=SJz1x20cFQ.
    Locate open access versionFindings
  • Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Count-based exploration in feature space for reinforcement learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 2471–2478, 2017. doi: 10.24963/ijcai.2017/344. URL https://doi.org/10.24963/ijcai.2017/344.
    Locate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, and Ramakrishna Vedantam. Unsupervised discovery of decision states for transfer in reinforcement learning. CoRR, abs/1907.10580, 2019. URL http://arxiv.org/abs/1907.10580.
    Findings
  • Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.
    Findings
  • Brendan O’Donoghue, Ian Osband, Rémi Munos, and Volodymyr Mnih. The uncertainty bellman equation and exploration. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 3836–3845, 2018. URL http://proceedings.mlr.press/v80/o-donoghue18a.html.
    Locate open access versionFindings
  • Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034, 2016.
    Google ScholarLocate open access versionFindings
  • Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer, Frederic Kaplan, et al. How can we define intrinsic motivation. In Proc. of the 8th Conf. on Epigenetic Robotics, volume 5, pp. 29–31, 2008.
    Google ScholarLocate open access versionFindings
  • Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282, 2018.
    Findings
  • Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017.
    Google ScholarLocate open access versionFindings
  • Sébastien Racanière, Theophane Weber, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter W. Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5690–5701, 2017. URL http://papers.nips.cc/paper/7152-imagination-augmented-agents-for-deep-reinforcement-learning.
    Locate open access versionFindings
  • Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561, 2017.
    Google ScholarLocate open access versionFindings
  • Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1530–1538, 2015. URL http://proceedings.mlr.press/v37/rezende15.html.
    Locate open access versionFindings
  • Jürgen Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pp. 1458–1463, 1991a.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991b.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18(2):173–187, 2006.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
    Findings
  • Christopher Stanton and Jeff Clune. Deep curiosity search: Intra-life exploration can improve performance on challenging deep reinforcement learning problems. arXiv preprint arXiv:1806.00553, 2018.
    Findings
  • Susanne Still and Doina Precup. An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012.
    Google ScholarLocate open access versionFindings
  • Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753– 2762, 2017.
    Google ScholarLocate open access versionFindings
  • T Tieleman and G Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Tech. Rep., Technical report, pp. 31, 2012.
    Google ScholarFindings
  • Amy Zhang, Nicolas Ballas, and Joelle Pineau. A dissection of overfitting and generalization in continuous reinforcement learning. arXiv preprint arXiv:1806.07937, 2018a.
    Findings
  • Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018b.
    Findings
  • Jingwei Zhang, Niklas Wetzel, Nicolai Dorka, Joschka Boedecker, and Wolfram Burgard. Scheduled intrinsic drive: A hierarchical take on intrinsically motivated exploration. arXiv preprint arXiv:1903.07400, 2019.
    Findings
  • All our models use the same network architecture for the policy and value networks. The input is passed through a sequence of three (for MiniGrid) or four (for the environments used by Pathak et al. (2017)) convolutional layers with 32 filters each, kernel size of 3x3, stride of 2 and padding of 1. An exponential linear unit (ELU; (Clevert et al. (2016))) is used after each convolution layer. The output of the last convolution layer is fed into a LSTM with 256 units. Two separate fully connected layers are used to predict the value function and the action from the LSTM feature representation.
    Google ScholarLocate open access versionFindings
  • For the singleton environments used in prior work, the agents are trained using visual inputs that are pre-processed similarly to Mnih et al. (2016). The RGB images are converted into gray-scale and resized to 42 × 42. The input given to both the policy and the state representation networks consists of the current frame concatenated with the previous three frames. In order to reduce overfitting, during training, we use action repeat of four. At inference time, we sample the policy without any action repeats.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论