AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Unlike other branches of statistics and machine learning, an reinforcement learning agent must consider the effects of its actions upon future experience

Behaviour Suite for Reinforcement Learning

ICLR, (2020)

Cited by: 53|Views340
EI
Full Text
Bibtex
Weibo

Abstract

This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design o...More
0
Introduction
  • The reinforcement learning (RL) problem describes an agent interacting with an environment with the goal of maximizing cumulative reward through time (Sutton & Barto, 2017).
  • This includes the scalability of the RL algorithms, the environments where the authors expect them to perform well, and the key issues outstanding in the design of a general intelligence system.
  • In this paper the authors introduce the Behaviour Suite for Reinforcement Learning: a collection of experiments designed to highlight key aspects of agent scalability.
Highlights
  • The reinforcement learning (RL) problem describes an agent interacting with an environment with the goal of maximizing cumulative reward through time (Sutton & Barto, 2017)
  • Unlike other branches of statistics and machine learning, an reinforcement learning agent must consider the effects of its actions upon future experience
  • In this paper we introduce the Behaviour Suite for Reinforcement Learning: a collection of experiments designed to highlight key aspects of agent scalability
  • 1.3 Open source code, reproducible research. As part of this project we open source github.com/deepmind/bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any reinforcement learning agent on bsuite
  • This section outlines the experiments included in the Behaviour Suite for Reinforcement Learning 2019 release
Results
  • Instead it is a collection of diagnostic experiments designed to provide insight into key aspects of agent behaviour.
  • As part of this project the authors open source github.com/deepmind/bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite.
  • The authors hope the Behaviour Suite for Reinforcement Learning, and its open source code, will provide significant value to the RL research community, and help to make key conceptual issues concrete and precise.
  • Research into general learning algorithms has been grounded by the performance on specific environments (Sutton & Barto, 2017).
  • This section outlines the experiments included in the Behaviour Suite for Reinforcement Learning 2019 release.
  • Researchers may still find it useful to investigate internal aspects of their agents on bsuite environments, but it is not part of the standard analysis.
  • To accompany the experiment descriptions, the authors present results and analysis comparing three baseline algorithms on bsuite: DQN (Mnih et al, 2015a), A2C (Mnih et al, 2016) and Bootstrapped DQN (Osband et al, 2016).
  • For the bsuite experiment the authors run the agent on sizes N = 1, .., 100 exponentially spaced and look at the average regret compared to optimal after 10k episodes.
  • For the bsuite experiment the authors run the agent on sizes N = 10, 12, .., 50 and look at the average regret compared to optimal after 10k episodes.
  • Since loading the environment via bsuite handles the logging automatically, any agent interacting with that environment will generate the data required for required for analysis through the Jupyter notebook the authors provide (Perez & Granger, 2007).
Conclusion
  • If you write a conference paper targeting some improvement to hierarchical reinforcement learning, you will likely provide some justification for your results in terms of theorems or experiments targeted to this setting.2 it is typically a large amount of work to evaluate your algorithm according to alternative metrics, such as exploration.
  • The authors are reaching out to researchers and practitioners to help collate the most informative, targeted, scalable and clear experiments possible for reinforcement learning.
  • Informative and scalable experiments; and providing accessible tools for reproducible evaluation the authors hope to facilitate progress in reinforcement learning research.
Summary
  • The reinforcement learning (RL) problem describes an agent interacting with an environment with the goal of maximizing cumulative reward through time (Sutton & Barto, 2017).
  • This includes the scalability of the RL algorithms, the environments where the authors expect them to perform well, and the key issues outstanding in the design of a general intelligence system.
  • In this paper the authors introduce the Behaviour Suite for Reinforcement Learning: a collection of experiments designed to highlight key aspects of agent scalability.
  • Instead it is a collection of diagnostic experiments designed to provide insight into key aspects of agent behaviour.
  • As part of this project the authors open source github.com/deepmind/bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite.
  • The authors hope the Behaviour Suite for Reinforcement Learning, and its open source code, will provide significant value to the RL research community, and help to make key conceptual issues concrete and precise.
  • Research into general learning algorithms has been grounded by the performance on specific environments (Sutton & Barto, 2017).
  • This section outlines the experiments included in the Behaviour Suite for Reinforcement Learning 2019 release.
  • Researchers may still find it useful to investigate internal aspects of their agents on bsuite environments, but it is not part of the standard analysis.
  • To accompany the experiment descriptions, the authors present results and analysis comparing three baseline algorithms on bsuite: DQN (Mnih et al, 2015a), A2C (Mnih et al, 2016) and Bootstrapped DQN (Osband et al, 2016).
  • For the bsuite experiment the authors run the agent on sizes N = 1, .., 100 exponentially spaced and look at the average regret compared to optimal after 10k episodes.
  • For the bsuite experiment the authors run the agent on sizes N = 10, 12, .., 50 and look at the average regret compared to optimal after 10k episodes.
  • Since loading the environment via bsuite handles the logging automatically, any agent interacting with that environment will generate the data required for required for analysis through the Jupyter notebook the authors provide (Perez & Granger, 2007).
  • If you write a conference paper targeting some improvement to hierarchical reinforcement learning, you will likely provide some justification for your results in terms of theorems or experiments targeted to this setting.2 it is typically a large amount of work to evaluate your algorithm according to alternative metrics, such as exploration.
  • The authors are reaching out to researchers and practitioners to help collate the most informative, targeted, scalable and clear experiments possible for reinforcement learning.
  • Informative and scalable experiments; and providing accessible tools for reproducible evaluation the authors hope to facilitate progress in reinforcement learning research.
Related work
  • The Behaviour Suite for Reinforcement Learning fits into a long history of RL benchmarks. From the beginning, research into general learning algorithms has been grounded by the performance on specific environments (Sutton & Barto, 2017). At first, these environments were typically motivated by small MDPs that instantiate the general learning problem. ‘CartPole’ (Barto et al, 1983) and ‘MountainCar’ (Moore, 1990) are examples of classic benchmarks that has provided a testing ground for RL algorithm development. Similarly, when studying specific capabilities of learning algorithms, it has often been helpful to design diagnostic environments with that capability in mind. Examples of this include ‘RiverSwim’ for exploration (Strehl & Littman, 2008) or ‘Taxi’ for temporal abstraction (Dietterich, 2000). Performance in these environments provide a targeted signal for particular aspects of algorithm development.
Funding
  • Introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning agents with two objectives
  • Introduces the Behaviour Suite for Reinforcement Learning: a collection of experiments designed to highlight key aspects of agent scalability
  • Provides a description of the current suite of experiments and the key issues they identify in Section 2
  • Provides more details on what makes an ‘excellent’ experiment in Section 2, and on how to engage in their construction for future iterations in Section 5
  • Provides guidelines for how researchers can use bsuite effectively in Section 3
Reference
  • Martın Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
    Findings
  • Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proc. of ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems 30, pp. 6241–6250, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.
    Google ScholarFindings
  • Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Kuttler, Andrew Lefrancq, Simon Green, Vıctor Valdes, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
    Findings
  • Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
    Findings
  • Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540.
    Findings
  • Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http://arxiv.org/abs/1812.06110.
    Findings
  • Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
    Findings
  • Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000.
    Google ScholarLocate open access versionFindings
  • Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
    Google ScholarLocate open access versionFindings
  • Richard Evans and Jim Gao. Deepmind AI reduces google data centre cooling bill by 40 https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/, 2016.
    Findings
  • Kunihiko Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position-neocognitron. IEICE Technical Report, A, 62(10):658–665, 1979.
    Google ScholarLocate open access versionFindings
  • John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148–164, 1979.
    Google ScholarLocate open access versionFindings
  • Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. CoRR, abs/1709.06560, 2017. URL http://arxiv.org/abs/1709.06560.
    Findings
  • Alexey Grigorevich Ivakhnenko. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control, 13:43–55, 1968.
    Google ScholarLocate open access versionFindings
  • Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pp. 586–594, 2016.
    Google ScholarLocate open access versionFindings
  • M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 2002.
    Google ScholarLocate open access versionFindings
  • Jeannette Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. 1952.
    Google ScholarFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
    Findings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
    Google ScholarLocate open access versionFindings
  • Shane Legg, Marcus Hutter, et al. A collection of definitions of intelligence. Frontiers in Artificial Intelligence and applications, 157:17, 2007.
    Google ScholarLocate open access versionFindings
  • Kurt Lewin. Psychology and the process of group living. The Journal of Social Psychology, 17(1): 113–131, 1943.
    Google ScholarLocate open access versionFindings
  • Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information Processing Systems, pp. 3260–3268, 2017.
    Google ScholarLocate open access versionFindings
  • Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
    Findings
  • Brenda Milner, Larry R Squire, and Eric R Kandel. Cognitive neuroscience and the study of memory. Neuron, 20(3):445–468, 1998.
    Google ScholarLocate open access versionFindings
  • Marvin Minsky. Steps towards artificial intelligence. Proceedings of the IRE, 1961.
    Google ScholarLocate open access versionFindings
  • miplib2017. MIPLIB 2017, 2018. http://miplib.zib.de.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control through Deep Reinforcement Learning. Nature, 518(7540):529–533, 2015a.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015b.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. of ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Andrew William Moore. Efficient memory-based learning for robot control. 1990.
    Google ScholarFindings
  • Alistair Muldal, Yotam Doron, and John Aslanides. dm env. https://github.com/deepmind/dm_env, 2017.
    Findings
  • Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, et al. Massively Parallel Methods for Deep Reinforcement Learning. In ICML Workshop on Deep Learning, 2015.
    Google ScholarLocate open access versionFindings
  • John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat. Brain research, 1971.
    Google ScholarLocate open access versionFindings
  • Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances In Neural Information Processing Systems 29, pp. 4026–4034, 2016.
    Google ScholarLocate open access versionFindings
  • Ian Osband, Daniel Russo, Zheng Wen, and Benjamin Van Roy. Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608, 2017.
    Findings
  • Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems 31, pp. 8617–8629. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8080-randomized-prior-functions-for-deep-reinforcement-learning.pdf.
    Locate open access versionFindings
  • Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides,, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado Van Hasselt. Behaviour suite for reinforcement learning. 2019.
    Google ScholarFindings
  • Jakub Pachocki, David Farhi, Szymon Sidor, Greg Brockman, Filip Wolski, Henrique PondÃľ, Jie Tang, Jonathan Raiman, Michael Petrov, Christy Dennison, Brooke Chan, Susan Zhang, RafaÅĆ JÃşzefowicz, and PrzemysÅĆaw DÄŹbiak. Openai five. https://openai.com/five, 2019.
    Findings
  • Fernando Perez and Brian E. Granger. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21–29, May 2007. ISSN 1521-9615. doi: 10.1109/ MCSE.2007.53. URL https://ipython.org.
    Locate open access versionFindings
  • Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
    Google ScholarLocate open access versionFindings
  • Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on Thompson sampling. arXiv preprint arXiv:1707.02038, 2017.
    Findings
  • AL Samuel. Some studies oin machine learning using the game of checkers. IBM Journal of Researchand Development, 3:211–229, 1959.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018. ISSN 0036-8075. doi: 10.1126/science.aar6404. URL https://science.sciencemag.org/content/362/6419/1140.
    Locate open access versionFindings
  • Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 2017. R.S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3, 1988. Brian Tanner and Adam White. Rl-glue: Language-independent software for reinforcement-learning experiments. Journal of Machine Learning Research, 10(Sep):2133–2136, 2009. Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David
    Google ScholarLocate open access versionFindings
  • This appendix outlines the experiments that make up the bsuite 2019 release. In the interests of brevity, we provide only an outline of each experiment here. Full documentation for the environments, interaction and analysis are kept with code at github.com/deepmind/bsuite.
    Google ScholarFindings
  • A.1.1 Simple bandit component environments interaction score issues description Finite-armed bandit with deterministic rewards [0, 0.1,..1] (Gittins, 1979). 20 seeds. 10k episodes, record regret vs optimal. regret normalized [random, optimal] → [0,1] basic
    Google ScholarLocate open access versionFindings
  • component environments interaction score issues description Contextual bandit classification of MNIST with ±1 rewards (LeCun et al., 1998). 20 seeds. 10k episodes, record average regret. regret normalized [random, optimal] → [0,1] basic, generalization
    Google ScholarLocate open access versionFindings
  • Agent can move a cart left/right on a plane to keep a balanced pole upright (Barto et al., 1983), 20 seeds.
    Google ScholarLocate open access versionFindings
  • A.1.5 Mountain car component environments interaction score issues description Agent drives an underpowered car up a hill (Moore, 1990), 20 seeds. 10k episodes, record average regret. regret normalized [random, optimal] → [0,1] basic, credit assignment, generalization
    Google ScholarLocate open access versionFindings
  • Cartpole ‘swing up’ problem with sparse reward (Barto et al., 1983), heigh limit x=[0, 0.5,.., 0.95].
    Google ScholarFindings
Author
Ian Osband
Ian Osband
Yotam Doron
Yotam Doron
John Aslanides
John Aslanides
Eren Sezener
Eren Sezener
Andre Saraiva
Andre Saraiva
Katrina McKinney
Katrina McKinney
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科