Massively Parallel Methods for Deep Reinforcement Learning

arun nair
arun nair
praveen srinivasan
praveen srinivasan
sam blackwell
sam blackwell
cagdas alcicek
cagdas alcicek
rory fearon
rory fearon
alessandro de maria
alessandro de maria
vedavyas panneershelvam
vedavyas panneershelvam
mustafa suleyman
mustafa suleyman

CoRR, Volume abs/1507.04296, 2015.

Cited by: 335|Views297
EI
Weibo:
A single machine had previously achieved state-of-the-art results in the challenging suite of Atari 2600 games, but it was not previously known whether the good performance of Deep Q-Network algorithm would continue to scale with additional computation

Abstract:

We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed ...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Deep learning methods have recently achieved state-ofthe-art results in vision and speech domains (Krizhevsky et al, 2012; Simonyan & Zisserman, 2014; Szegedy et al, 2014; Graves et al, 2013; Dahl et al, 2012), mainly due to their ability to automatically learn high-level features from a supervised signal.
  • A new method for training such deep Q-networks, known as DQN, has enabled RL to learn control policies in complex environments with high dimensional images as inputs (Mnih et al, 2015).
  • This method outperformed a human professional in many.
  • The parameter server can be sharded across many machines and different shards apply gradients independently of other shards
Highlights
  • Deep learning methods have recently achieved state-ofthe-art results in vision and speech domains (Krizhevsky et al, 2012; Simonyan & Zisserman, 2014; Szegedy et al, 2014; Graves et al, 2013; Dahl et al, 2012), mainly due to their ability to automatically learn high-level features from a supervised signal
  • This architecture consists of four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed experience replay memory
  • The first follows the protocol established by Deep Q-Network algorithm
  • In this paper we have introduced the first massively distributed architecture for deep reinforcement learning
  • A single machine had previously achieved state-of-the-art results in the challenging suite of Atari 2600 games, but it was not previously known whether the good performance of Deep Q-Network algorithm would continue to scale with additional computation
  • Gorila takes a further step towards fulfilling the promise of deep learning in reinforcement learning: a scalable architecture that performs better and better with increased computation and memory
Methods
  • The authors closely followed the experimental setup of DQN (Mnih et al, 2015) using the same preprocessing and network architecture.
  • The authors preprocessed the 210 × 160 RGB images by downsampling them to 84 × 84 and extracting the luminance channel.
  • The Q-network Q(s, a; θ) had 3 convolutional layers followed by a fully-connected hidden layer.
  • The 84 × 84 × 4 input to the network is obtained by concatenating the images from four previous preprocessed frames.
  • Each hidden layer was followed by a rectifier nonlinearity
Results
  • The agents were allowed to play until the end of the game or up to 18000 frames (5 minutes), whichever came first, and the scores were averaged over all 30 episodes.
  • The authors refer to this evaluation procedure as null op starts
Conclusion
  • In this paper the authors have introduced the first massively distributed architecture for deep reinforcement learning.
  • A single machine had previously achieved state-of-the-art results in the challenging suite of Atari 2600 games, but it was not previously known whether the good performance of DQN would continue to scale with additional computation.
  • Gorila DQN significantly outperformed singleGPU DQN on 41 out of 49 games; achieving by far the best results in this domain to date.
  • Gorila takes a further step towards fulfilling the promise of deep learning in RL: a scalable architecture that performs better and better with increased computation and memory
Summary
  • Introduction:

    Deep learning methods have recently achieved state-ofthe-art results in vision and speech domains (Krizhevsky et al, 2012; Simonyan & Zisserman, 2014; Szegedy et al, 2014; Graves et al, 2013; Dahl et al, 2012), mainly due to their ability to automatically learn high-level features from a supervised signal.
  • A new method for training such deep Q-networks, known as DQN, has enabled RL to learn control policies in complex environments with high dimensional images as inputs (Mnih et al, 2015).
  • This method outperformed a human professional in many.
  • The parameter server can be sharded across many machines and different shards apply gradients independently of other shards
  • Methods:

    The authors closely followed the experimental setup of DQN (Mnih et al, 2015) using the same preprocessing and network architecture.
  • The authors preprocessed the 210 × 160 RGB images by downsampling them to 84 × 84 and extracting the luminance channel.
  • The Q-network Q(s, a; θ) had 3 convolutional layers followed by a fully-connected hidden layer.
  • The 84 × 84 × 4 input to the network is obtained by concatenating the images from four previous preprocessed frames.
  • Each hidden layer was followed by a rectifier nonlinearity
  • Results:

    The agents were allowed to play until the end of the game or up to 18000 frames (5 minutes), whichever came first, and the scores were averaged over all 30 episodes.
  • The authors refer to this evaluation procedure as null op starts
  • Conclusion:

    In this paper the authors have introduced the first massively distributed architecture for deep reinforcement learning.
  • A single machine had previously achieved state-of-the-art results in the challenging suite of Atari 2600 games, but it was not previously known whether the good performance of DQN would continue to scale with additional computation.
  • Gorila DQN significantly outperformed singleGPU DQN on 41 out of 49 games; achieving by far the best results in this domain to date.
  • Gorila takes a further step towards fulfilling the promise of deep learning in RL: a scalable architecture that performs better and better with increased computation and memory
Tables
  • Table1: NULL OP NORMALIZED
  • Table2: HUMAN STARTS NORMALIZED
  • Table3: RAW DATA - HUMAN STARTS
  • Table4: RAW DATA - NULL OP
Download tables as Excel
Related work
  • There have been several previous approaches to parallel or distributed RL. A significant part of this work has focused on distributed multi-agent systems (Weiss, 1995; Lauer & Riedmiller, 2000). In this approach, there are many agents taking actions within a single shared environment, working cooperatively to achieve a common objective. While computation is distributed in the sense of decentralized control, these algorithms focus on effective teamwork and emergent group behaviors. Another paradigm which has been explored is concurrent reinforcement learning (Silver et al, 2013), in which an agent can interact in parallel with an inherently distributed environment, e.g. to optimize interactions with multiple users on the internet. Our goal is quite different to both these distributed and concurrent RL paradigms: we simply seek to solve a single-agent problem more efficiently by exploiting parallel computation.
Reference
  • Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. arXiv preprint arXiv:1207.4708, 2012.
    Findings
  • Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Catanzaro, Bryan, and Andrew, Ng. Deep learning with cots hpc systems. In Proceedings of The 30th International Conference on Machine Learning, pp. 1337–1345, 2013.
    Google ScholarLocate open access versionFindings
  • Dahl, George E, Yu, Dong, Deng, Li, and Acero, Alex. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30– 42, 2012.
    Google ScholarLocate open access versionFindings
  • Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai, Devin, Matthieu, Mao, Mark, Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pp. 1223–1231, 2012.
    Google ScholarLocate open access versionFindings
  • Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
    Google ScholarLocate open access versionFindings
  • Graves, Alex, Mohamed, A-R, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Grounds, Matthew and Kudenko, Daniel. Parallel reinforcement learning with linear function approximation. In Proceedings of the 5th, 6th and 7th European Conference on Adaptive and Learning Agents and Multi-agent Systems: Adaptation and Multi-agent Learning, pp. 60– 74. Springer-Verlag, 2008.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoff. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.
    Google ScholarLocate open access versionFindings
  • Lauer, Martin and Riedmiller, Martin. An algorithm for distributed reinforcement learning in cooperative multiagent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 535– 542. Morgan Kaufmann, 2000.
    Google ScholarLocate open access versionFindings
  • Li, Yuxi and Schuurmans, Dale. Mapreduce for parallel reinforcement learning. In Recent Advances in Reinforcement Learning - 9th European Workshop, EWRL 2011, Athens, Greece, September 9-11, 2011, Revised Selected Papers, pp. 309–320, 2011.
    Google ScholarLocate open access versionFindings
  • Lin, Long-Ji. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
    Google ScholarFindings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
    Google ScholarFindings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. URL http://dx.doi.org/10.1038/nature14236.
    Locate open access versionFindings
  • Silver, David, Newnham, Leonard, Barker, David, Weller, Suzanne, and McFall, Jason. Concurrent reinforcement learning from customer interactions. In Proceedings of the 30th International Conference on Machine Learning, pp. 924–932, 2013.
    Google ScholarLocate open access versionFindings
  • Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Sutton, R. and Barto, A. Reinforcement Learning: an Introduction. MIT Press, 1998.
    Google ScholarFindings
  • Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
    Findings
  • Tsitsiklis, J. and Roy, B. Van. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997.
    Google ScholarLocate open access versionFindings
  • Weiss, Gerhard. Distributed reinforcement learning. 15: 135–142, 1995.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments