AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We introduced value prediction networks as a new deep reinforcement learning way of integrating planning and learning while simultaneously learning the dynamics of abstract-states that make option-conditional predictions of future rewards/discount/values rather than future observ...

Value Prediction Network.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), (2017)

被引用220|浏览189
EI
下载 PDF 全文
引用
微博一下

摘要

This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predi...更多

代码

数据

0
简介
  • Model-based reinforcement learning (RL) approaches attempt to learn a model that predicts future observations conditioned on actions and can be used to simulate the real environment and do multi-step lookaheads for planning.
  • The authors will call such models an observation-prediction model to distinguish it from another form of model introduced in this paper.
重点内容
  • Model-based reinforcement learning (RL) approaches attempt to learn a model that predicts future observations conditioned on actions and can be used to simulate the real environment and do multi-step lookaheads for planning
  • We show that Value Prediction Network is more robust to stochasticity in the environment than an observation-prediction model approach
  • A CNN was used as the encoding module of Value Prediction Network, and the transition module consists of one option-conditional convolution layer which uses different weights depending on the option followed by a few more convolution layers
  • To investigate how Value Prediction Network deals with complex visual observations, we evaluated it on several Atari games [2]
  • We introduced value prediction networks (VPNs) as a new deep reinforcement learning way of integrating planning and learning while simultaneously learning the dynamics of abstract-states that make option-conditional predictions of future rewards/discount/values rather than future observations
  • Our empirical evaluations showed that Value Prediction Network outperform model-free Deep Q-Network baselines in multiple domains, and outperform traditional observation-based planning in a stochastic domain
方法
  • The authors' experiments investigated the following questions: 1) Does VPN outperform model-free baselines (e.g., DQN)? 2) What is the advantage of planning with a VPN over observation-based planning? 3) Is VPN useful for complex domains with high-dimensional sensory inputs, such as Atari games?

    4.1 Experimental Setting

    Network Architecture.
  • A CNN was used as the encoding module of VPN, and the transition module consists of one option-conditional convolution layer which uses different weights depending on the option followed by a few more convolution layers.
  • The value module consists of two fully-connected layers.
  • The number of layers and hidden units vary depending on the domain.
  • These details are described in the supplementary material
结论
  • The authors introduced value prediction networks (VPNs) as a new deep RL way of integrating planning and learning while simultaneously learning the dynamics of abstract-states that make option-conditional predictions of future rewards/discount/values rather than future observations.
  • The authors' empirical evaluations showed that VPNs outperform model-free DQN baselines in multiple domains, and outperform traditional observation-based planning in a stochastic domain.
表格
  • Table1: Table 1
  • Table2: Performance on Atari games. Each number represents average score over 5 top agents
Download tables as Excel
相关工作
  • Model-based Reinforcement Learning. Dyna-Q [32, 34, 39] integrates model-free and modelbased RL by learning an observation-prediction model and using it to generate samples for Q-learning in addition to the model-free samples obtained by acting in the real environment. Gu et al [7] extended these ideas to continuous control problems. Our work is similar to Dyna-Q in the sense that planning and learning are integrated into one architecture. However, VPNs perform a lookahead tree search to choose actions and compute bootstrapped targets, whereas Dyna-Q uses a learned model to generate imaginary samples. In addition, Dyna-Q learns a model of the environment separately from a value function approximator. In contrast, the dynamics model in VPN is combined with the value function approximator in a single neural network and indirectly learned from reward and value predictions through backpropagation.
基金
  • This work was supported by NSF grant IIS-1526059
引用论文
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
    Findings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. arXiv preprint arXiv:1207.4708, 2012.
    Findings
  • C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. Computational Intelligence and AI in Games, IEEE Transactions on, 4(1):1–43, 2012.
    Google ScholarLocate open access versionFindings
  • S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
    Google ScholarFindings
  • C. Finn and S. Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • X. Guo, S. P. Singh, R. L. Lewis, and H. Lee. Deep learning for reward design to improve monte carlo tree search in atari games. In IJCAI, 2016.
    Google ScholarLocate open access versionFindings
  • M. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527, 2015.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • N. Heess, G. Wayne, D. Silver, T. P. Lillicrap, Y. Tassa, and T. Erez. Learning continuous control policies by stochastic value gradients. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, V. Mnih, W. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
    Findings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In ECML, 2006.
    Google ScholarLocate open access versionFindings
  • T. D. Kulkarni, A. Saeedi, S. Gautam, and S. Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.
    Findings
  • A. S. Lakshminarayanan, S. Sharma, and B. Ravindran. Dynamic action repetition for deep reinforcement learning. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In RSS, 2015.
    Google ScholarLocate open access versionFindings
  • N. Mishra, P. Abbeel, and I. Mordatch. Prediction and control with temporal segment models. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • J. Oh, V. Chockalingam, S. Singh, and H. Lee. Control of memory, active perception, and action in minecraft. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • E. Parisotto and R. Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. arXiv preprint arXiv:1702.08360, 2017.
    Findings
  • D. Precup. Temporal abstraction in reinforcement learning. PhD thesis, University of Massachusetts, Amherst, 2000.
    Google ScholarFindings
  • T. Raiko and M. Tornio. Variational bayesian learning of nonlinear hidden state-space models for model predictive control. Neurocomputing, 72(16):3704–3712, 2009.
    Google ScholarLocate open access versionFindings
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • D. Silver, R. S. Sutton, and M. Müller. Temporal-difference search in computer go. Machine Learning, 87:183–219, 2012.
    Google ScholarLocate open access versionFindings
  • D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. Reichert, N. Rabinowitz, A. Barreto, and T. Degris. The predictron: End-to-end learning and planning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
    Findings
  • M. Stolle and D. Precup. Learning options in reinforcement learning. In SARA, 2002.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton, C. Szepesvári, A. Geramifard, and M. H. Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI, 2008.
    Google ScholarLocate open access versionFindings
  • A. Tamar, S. Levine, P. Abbeel, Y. Wu, and G. Thomas. Value iteration networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • A. Vezhnevets, V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou, and K. Kavukcuoglu. Strategic attentive writer for learning macro-actions. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures for deep reinforcement learning. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
  • H. Yao, S. Bhatnagar, D. Diao, R. S. Sutton, and C. Szepesvári. Multi-step dyna planning for policy evaluation and control. In NIPS, 2009.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科