Human-level control through deep reinforcement learning

NATURE, pp. 529-533, 2015.

被引用13805|浏览1297
WOS NATURE EI
微博一下
We developed a novel agent, a deep Q-network, which is able to combine reinforcement learning with a class of artificial neural network16 known as deep neural networks

摘要

The theory of reinforcement learning provides a normative account', deeply rooted in psychological' and neuroscientifie perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted wit...更多

代码

数据

0
下载 PDF 全文
引用
微博一下
简介
  • The theory of reinforcement learning provides a normative account[1], deeply rooted in psychological[2] and neuroscientific[3] perspectives on animal behaviour, of how agents may optimize their control of an environment.
  • The authors use a deep convolutional neural network to approximate the optimal action-value function
重点内容
  • The theory of reinforcement learning provides a normative account[1], deeply rooted in psychological[2] and neuroscientific[3] perspectives on animal behaviour, of how agents may optimize their control of an environment
  • We developed a novel agent, a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial neural network[16] known as deep neural networks
  • In additional simulations, we demonstrate the importance of the individual core components of the deep Q-network agent—the replay memory, separate target Q-network and deep convolutional network architecture—by disabling them and demonstrating the detrimental effects on performance
  • We examined the representations learned by deep Q-network that underpinned the successful performance of the agent in the context of the game Space Invaders, by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’25 (Fig. 4)
  • We show that the representations learned by deep Q-network are able to generalize to data generated from policies other than its own—in simulations where we presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary Discussion)
  • Extended Data Fig. 2 provides an additional illustration of how the representations learned by deep Q-network allow it to accurately predict state and action values
结果
  • The authors used the same network architecture, hyperparameter values and learning procedure throughout—taking high-dimensional data (210|160 colour video at 60 Hz) as input—to demonstrate that the approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge.
  • The authors compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available[12,15].
  • In addition to the learned agents, the authors report scores for a professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% and 0% on y axis; see Methods).
  • The authors examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders, by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’25 (Fig. 4).
  • The authors found instances in which the t-SNE algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top left and middle), consistent with the notion that the network is able to learn representations that support adaptive behaviour from high-dimensional sensory inputs.
  • The authors show that the representations learned by DQN are able to generalize to data generated from policies other than its own—in simulations where the authors presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary Discussion).
  • Extended Data Fig. 2 provides an additional illustration of how the representations learned by DQN allow it to accurately predict state and action values.
结论
  • The authors demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm, network architecture and hyperparameters on each game, privy only to the inputs a human player would have.
  • In contrast to previous work[24,26], the approach incorporates ‘end-to-end’ reinforcement learning that uses reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation.
总结
  • The theory of reinforcement learning provides a normative account[1], deeply rooted in psychological[2] and neuroscientific[3] perspectives on animal behaviour, of how agents may optimize their control of an environment.
  • The authors use a deep convolutional neural network to approximate the optimal action-value function
  • The authors used the same network architecture, hyperparameter values and learning procedure throughout—taking high-dimensional data (210|160 colour video at 60 Hz) as input—to demonstrate that the approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge.
  • The authors compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available[12,15].
  • In addition to the learned agents, the authors report scores for a professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% and 0% on y axis; see Methods).
  • The authors examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders, by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’25 (Fig. 4).
  • The authors found instances in which the t-SNE algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top left and middle), consistent with the notion that the network is able to learn representations that support adaptive behaviour from high-dimensional sensory inputs.
  • The authors show that the representations learned by DQN are able to generalize to data generated from policies other than its own—in simulations where the authors presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary Discussion).
  • Extended Data Fig. 2 provides an additional illustration of how the representations learned by DQN allow it to accurately predict state and action values.
  • The authors demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm, network architecture and hyperparameters on each game, privy only to the inputs a human player would have.
  • In contrast to previous work[24,26], the approach incorporates ‘end-to-end’ reinforcement learning that uses reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation.
表格
  • Table1: List of hyperparameters and their values
  • Table2: Comparison of games scores obtained by DQN agents with methods from the literature[<a class="ref-link" id="c12" href="#r12">12</a>,<a class="ref-link" id="c15" href="#r15">15</a>] and a professional human games tester
  • Table3: The effects of replay and separating the target Q-network
  • Table4: Comparison of DQN performance with linear function approximator
Download tables as Excel
基金
  • Demonstrates that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters
  • Developed a novel agent, a deep Q-network , which is able to combine reinforcement learning with a class of artificial neural network16 known as deep neural networks
  • Addresses these instabilities with a novel variant of Q-learning, which uses two key ideas
  • Reports scores for a professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random
  • Demonstrates the importance of the individual core components of the DQN agent—the replay memory, separate target Q-network and deep convolutional network architecture—by disabling them and demonstrating the detrimental effects on performance
  • Our DQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games, achieving more than 75% of the human score on more than half of the games (29 games;
引用论文
  • Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998).
    Google ScholarFindings
  • Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911).
    Google ScholarFindings
  • Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
    Google ScholarLocate open access versionFindings
  • Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000 (2005).
    Google ScholarLocate open access versionFindings
  • Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980).
    Google ScholarLocate open access versionFindings
  • Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).
    Google ScholarLocate open access versionFindings
  • Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. Reinforcement learning for robot soccer. Auton. Robots 27, 55–73 (2009).
    Google ScholarLocate open access versionFindings
  • Diuk, C., Cohen, A. & Littman, M. L. An object-oriented representation for efficient reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008).
    Google ScholarLocate open access versionFindings
  • Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1–127 (2009).
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012).
    Google ScholarLocate open access versionFindings
  • Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
    Google ScholarLocate open access versionFindings
  • Legg, S. & Hutter, M. Universal Intelligence: a definition of machine intelligence. Minds Mach. 17, 391–444 (2007).
    Google ScholarLocate open access versionFindings
  • Genesereth, M., Love, N. & Pell, B. General game playing: overview of the AAAI competition. AI Mag. 26, 62–72 (2005).
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012).
    Google ScholarLocate open access versionFindings
  • McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, 1986).
    Google ScholarFindings
  • LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
    Google ScholarLocate open access versionFindings
  • Hubel, D. H. & Wiesel, T. N. Shape and arrangement of columns in cat’s striate cortex. J. Physiol. 165, 559–568 (1963).
    Google ScholarLocate open access versionFindings
  • Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992).
    Google ScholarLocate open access versionFindings
  • Tsitsiklis, J. & Roy, B. V. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997).
    Google ScholarLocate open access versionFindings
  • McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457 (1995).
    Google ScholarLocate open access versionFindings
  • O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. Play it again: reactivation of waking experience and memory. Trends Neurosci. 33, 220–229 (2010).
    Google ScholarLocate open access versionFindings
  • Lin, L.-J. Reinforcement learning for robots using neural networks. Technical Report, DTIC Document (1993).
    Google ScholarFindings
  • Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. Mach. Learn.: ECML, 3720, 317–328 (Springer, 2005).
    Google ScholarFindings
  • Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
    Google ScholarLocate open access versionFindings
  • Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010).
    Google ScholarLocate open access versionFindings
  • Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative and perceptual learning on a visual decision task. Nature Neurosci. 12, 655 (2009).
    Google ScholarLocate open access versionFindings
  • Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002).
    Google ScholarLocate open access versionFindings
  • Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep. Nature Neurosci. 15, 1439–1444 (2012).
    Google ScholarLocate open access versionFindings
  • Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less data and less real time. Mach. Learn. 13, 103–130 (1993).
    Google ScholarLocate open access versionFindings
  • Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-stage architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153 (2009).
    Google ScholarLocate open access versionFindings
  • Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).
    Google ScholarLocate open access versionFindings
  • Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101, 99–134 (1994).
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论