Leveraging Procedural Generation to Benchmark Reinforcement Learning

Cobbe Karl
Cobbe Karl
Hilton Jacob
Hilton Jacob

ICML, pp. 2048-2056, 2019.

Cited by: 19|Bibtex|Views16
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning

Abstract:

In this report, we introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimenta...More

Code:

Data:

0
Introduction
  • Generalization remains one of the most fundamental challenges in deep reinforcement learning.
  • In several recent studies (Zhang et al, 2018c; Cobbe et al, 2019; Justesen et al, 2018; Juliani et al, 2019), agents exhibit the capacity to overfit to remarkably large training sets.
  • This evidence raises the possibility that overfitting pervades classic benchmarks like the Arcade Learning Environment (ALE) (Bellemare et al, 2013), which has long served as a gold standard in RL.
  • For each game the question must be asked: are agents robustly learning a relevant skill, or are they approximately memorizing specific trajectories?
Highlights
  • Generalization remains one of the most fundamental challenges in deep reinforcement learning
  • In several recent studies (Zhang et al, 2018c; Cobbe et al, 2019; Justesen et al, 2018; Juliani et al, 2019), agents exhibit the capacity to overfit to remarkably large training sets. This evidence raises the possibility that overfitting pervades classic benchmarks like the Arcade Learning Environment (ALE) (Bellemare et al, 2013), which has long served as a gold standard in RL
  • To provide a point of comparison, we evaluate our Procgen-tuned implementation of PPO on the Arcade Learning Environment, and we achieve competitive performance
  • We investigate how scaling model size impacts both sample efficiency and generalization in RL
  • We find that larger architectures significantly improve both sample efficiency and generalization
  • We’ve designed Procgen Benchmark to help the community to contend with this challenge
Results
  • It is notable that the small Nature-CNN model almost completely fails to train.
  • These results align with the results from Cobbe et al (2019), and the authors establish that this trend holds across many diverse environments.
  • 6. PPO performs much more consistently across the benchmark, though Rainbow offers a significant improvement in several environments.
  • The authors are not presently able to diagnose the instability that leads to Rainbow’s low performance in some environments, though the authors consider this an interesting avenue for further research
Conclusion
  • Training agents capable of generalizing across environments remains one of the greatest challenges in reinforcement learning.
  • The authors have designed Procgen Benchmark to help the community to contend with this challenge.
  • The intrinsic diversity within level distributions makes this benchmark ideal for evaluating both generalization and sample efficiency in RL.
  • The authors expect many insights gleaned from this benchmark to apply in more complex settings, and the authors look forward to leveraging these environments to design more capable and efficient algorithms
Summary
  • Introduction:

    Generalization remains one of the most fundamental challenges in deep reinforcement learning.
  • In several recent studies (Zhang et al, 2018c; Cobbe et al, 2019; Justesen et al, 2018; Juliani et al, 2019), agents exhibit the capacity to overfit to remarkably large training sets.
  • This evidence raises the possibility that overfitting pervades classic benchmarks like the Arcade Learning Environment (ALE) (Bellemare et al, 2013), which has long served as a gold standard in RL.
  • For each game the question must be asked: are agents robustly learning a relevant skill, or are they approximately memorizing specific trajectories?
  • Results:

    It is notable that the small Nature-CNN model almost completely fails to train.
  • These results align with the results from Cobbe et al (2019), and the authors establish that this trend holds across many diverse environments.
  • 6. PPO performs much more consistently across the benchmark, though Rainbow offers a significant improvement in several environments.
  • The authors are not presently able to diagnose the instability that leads to Rainbow’s low performance in some environments, though the authors consider this an interesting avenue for further research
  • Conclusion:

    Training agents capable of generalizing across environments remains one of the greatest challenges in reinforcement learning.
  • The authors have designed Procgen Benchmark to help the community to contend with this challenge.
  • The intrinsic diversity within level distributions makes this benchmark ideal for evaluating both generalization and sample efficiency in RL.
  • The authors expect many insights gleaned from this benchmark to apply in more complex settings, and the authors look forward to leveraging these environments to design more capable and efficient algorithms
Related work
  • Many recent RL benchmarks grapple with generalization in different ways. The Sonic benchmark (Nichol et al, 2018) was designed to measure generalization in RL by separating levels of the Sonic the HedgehogTM video game into training and test sets. However, RL agents struggled to generalize from the few available training levels, and progress was hard to measure. The CoinRun environment (Cobbe et al, 2019) addressed this concern by procedurally generating large training and test sets to better measure generalization. CoinRun serves as the inaugural environment in Procgen Benchmark.

    The General Video Game AI (GVG-AI) framework (Perez-Liebana et al, 2018) has also encouraged the use of procedural generation in deep RL. Using 4 procedurally generated environments based on classic video games, Justesen et al (2018) measured generalization across different level distributions, finding that agents strongly overfit to their particular training set. Environments in Procgen Benchmark are designed in a similar spirit, with two of the environments (Miner and Leaper) drawing direct inspiration from this work.
Reference
  • J. Achiam, A. Ray, and D. Amodei. Safety gym. https://openai.com/blog/safety-gym/, 2019.
    Findings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
    Google ScholarLocate open access versionFindings
  • B. Beyret, J. Hern’andez-Orallo, L. Cheke, M. Halina, M. Shanahan, and M. Crosby. The animal-ai environment: Training and testing animal-like artificial cognition. 2019.
    Google ScholarFindings
  • K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 1282–1289, 2019. URL http://proceedings.mlr.press/v97/cobbe19a.html.
    Locate open access versionFindings
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. CoRR, abs/1802.01561, 2018.
    Findings
  • J. Farebrother, M. C. Machado, and M. Bowling. Generalization and regularization in DQN. CoRR, abs/1810.00123, 2018. URL http://arxiv.org/abs/1810.00123.
    Findings
  • X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
    Google ScholarLocate open access versionFindings
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • L. Johnson, G. N. Yannakakis, and J. Togelius. Cellular automata for real-time generation of infinite cave levels. In Proceedings of the 2010 Workshop on Procedural Content Generation in Games, page ACM, 2010.
    Google ScholarLocate open access versionFindings
  • A. Juliani, A. Khalifa, V.-P. Berges, J. Harper, H. Henry, A. Crespi, J. Togelius, and D. Lange. Obstacle tower: A generalization challenge in vision, control, and planning. arXiv preprint arXiv:1902.01378, 2019.
    Findings
  • N. Justesen, R. R. Torrado, P. Bontrager, A. Khalifa, J. Togelius, and S. Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. CoRR, abs/1806.10729, 2018.
    Findings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. In Proceedings of the American Mathematical Society, 7, pages 48–50, 1956.
    Google ScholarLocate open access versionFindings
  • K. Lee, K. Lee, J. Shin, and H. Lee. A simple randomization technique for generalization in deep reinforcement learning. arXiv preprint arXiv:1910.05396, 2019.
    Findings
  • M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. J. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman. Gotta learn fast: A new benchmark for generalization in RL. CoRR, abs/1804.03720, 20URL http://arxiv.org/abs/1804.03720.
    Findings
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepesvari, S. Singh, B. Van Roy, R. Sutton, D. Silver, and H. van Hasselt. Behaviour suite for reinforcement learning. 2019.
    Google ScholarFindings
  • C. Packer, K. Gao, J. Kos, P. Krahenbuhl, V. Koltun, and D. Song. Assessing generalization in deep reinforcement learning. CoRR, abs/1810.12282, 2018.
    Findings
  • D. Perez-Liebana, J. Liu, A. Khalifa, R. D. Gaina, J. Togelius, and S. M. Lucas. General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms. arXiv preprint arXiv:1802.10363, 2018.
    Findings
  • V. Pfau, A. Nichol, C. Hesse, L. Schiavo, J. Schulman, and O. Klimov. Gym retro. https://openai.com/blog/gym-retro/, 2018.
    Findings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
    Findings
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Metaworld: A benchmark and evaluation for multi-task and meta-reinforcement learning, 2019. URL https://github.com/rlworkgroup/metaworld.
    Findings
  • A. Zhang, N. Ballas, and J. Pineau. A dissection of overfitting and generalization in continuous reinforcement learning. CoRR, abs/1806.07937, 2018a.
    Findings
  • A. Zhang, Y. Wu, and J. Pineau. Natural environment benchmarks for reinforcement learning. arXiv preprint arXiv:1811.06032, 2018b.
    Findings
  • C. Zhang, O. Vinyals, R. Munos, and S. Bengio. A study on overfitting in deep reinforcement learning. CoRR, abs/1804.06893, 2018c. URL http://arxiv.org/abs/1804.06893.
    Findings
  • In all environments, procedural generation controls the selection of game assets and backgrounds, though some environments include a more diverse pool of assets and backgrounds than others. When procedural generation must place entities, it generally samples from the uniform distribution over valid locations, occasionally subject to game-specific constraints. Several environments use cellular automata (Johnson et al., 2010) to generate diverse level layouts.
    Google ScholarFindings
  • Procedural generation controls the level layout by generating mazes using Kruskal’s algorithm (Kruskal, 1956), and then removing walls until no deadends remain. The large stars are constrained to spawn in different quadrants. Initial enemy spawn locations are randomly selected.
    Google ScholarFindings
  • Procedural generation controls the level layout by generating mazes using Kruskal’s algorithm (Kruskal, 1956), uniformly ranging in size from 3x3 to 25x25.
    Google ScholarFindings
  • Procedural generation controls the level layout by generating mazes using Kruskal’s algorithm (Kruskal, 1956). Locks and keys are randomly placed, subject to solvability constraints.
    Google ScholarFindings
  • To better understand the strengths and limitations of current RL algorithms, it is valuable to have environments which isolate critical axes of performance. Osband et al. (2019) recently proposed seven core RL capabilities to profile with environments in bsuite. We focus our attention on three of these core capabilities: generalization, exploration, and memory. Among these, Procgen Benchmark contributes most directly to the evaluation of generalization, as we have already discussed at length. In this section, we describe how Procgen environments can also shed light on the core capabilities of exploration and memory.
    Google ScholarLocate open access versionFindings
  • The 8 environments that specifically support the evaluation of exploration are CoinRun, CaveFlyer, Leaper, Jumper, Maze, Heist, Climber, and Ninja. For each environment, we handpick a level seed that presents a significant exploration challenge. Instruction for training on these specific seeds can be found at https://github.com/openai/train-procgen. On these levels, a random agent is extraordinarily unlikely to encounter any reward. For this reason, our baseline PPO implementation completely fails to train, achieving a mean return of 0 in all environments after 200M timesteps of training.
    Locate open access versionFindings
  • The 6 environments that specifically support the evaluation of memory are CoinRun, CaveFlyer, Dodgeball, Miner, Jumper, Maze, and Heist. In this setting, we modify the environments as follows. In all environments we increase the world size. In Caveflyer and Jumper, we remove logic in level generation that prunes away paths which do not lead to the goal. In Dodgeball, Miner, Maze, and Heist, we make the environments partially observable by restricting observations to a small patch of space surrounding the agent. We note that Caveflyer and Jumper were already partially observable. With these changes, agents can reliably solve levels only by utilizing memory. Instructions for training environments in memory mode can be found at https://github.com/openai/trainprocgen.
    Locate open access versionFindings
  • We use the Adam optimizer (Kingma and Ba, 2014) in all experiments.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments