# Action-Conditional Video Prediction using Deep Networks in Atari Games

Annual Conference on Neural Information Processing Systems, (2015): 2863-2871

EI

关键词

摘要

Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future image-frames depend on control variables or actions as well as previous frames. While not composed of natural scenes, frames in...更多

代码：

数据：

简介

- Deep learning approaches have shown great success in many visual perception problems (e.g., [16, 7, 32, 9]).
- The authors focus on Atari games from the Arcade Learning Environment (ALE) [1] as a source of challenging action-conditional video modeling problems.
- While not composed of natural scenes, frames in Atari games are high-dimensional, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability.
- To the best of the knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional images conditioned by control inputs

重点内容

- Over the years, deep learning approaches have shown great success in many visual perception problems (e.g., [16, 7, 32, 9])
- In vision-based reinforcement learning (RL) problems, learning to predict future images conditioned on actions amounts to learning a model of the dynamics of the agent-environment interaction, an essential component of model-based approaches to reinforcement learning
- We focus on Atari games from the Arcade Learning Environment (ALE) [1] as a source of challenging action-conditional video modeling problems
- While not composed of natural scenes, frames in Atari games are high-dimensional, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability
- An example of long-term predictions is illustrated in Figure 2
- This paper introduced two different novel deep architectures that predict future frames that are dependent on actions and showed qualitatively and quantitatively that they are able to predict visuallyrealistic and useful-for-control frames over 100-step futures on several Atari game domains

方法

- In the experiments that follow, the authors have the following goals for the two architectures. 1) To evaluate the predicted frames in two ways: qualitatively evaluating the generated video, and quantitatively evaluating the pixel-based squared error, 2) To evaluate the usefulness of predicted frames for control in two ways: by replacing the emulator’s frames with predicted frames for use by DQN, and by using the predictions to improve exploration in DQN, and 3) To analyze the representations learned by the architectures.
- Data and Preprocessing.
- The authors used the replication of DQN to generate game-play video datasets using an -greedy policy with = 0.3, i.e. DQN is forced to choose a random action with 30% probability.
- The dataset consists of about 500, 000 training frames and 50, 000 test frames with actions chosen by DQN.
- Following DQN, actions are chosen once every 4 frames which reduces the video from 60fps to 15fps.
- The authors used full-resolution RGB images (210 × 160) and preprocessed the images by subtracting mean pixel values and dividing each pixel value by 255

结果

**Evaluation of Predicted Frames**

Qualitative Evaluation: Prediction video. The prediction videos of the models and baselines are available in the supplementary material and at the following website: https://sites.google. com/a/umich.edu/junhyuk-oh/action-conditional-video-prediction.- As seen in the videos, the proposed models make qualitatively reasonable predictions over 30–500 steps depending on the game.
- An example of long-term predictions is illustrated in Figure 2.
- The authors observed that both of the models predict complex local translations well such as the movement of vehicles and the controlled object.
- In Figure 2, the model predicts the sudden change of the location of the controlled object at 257-step

结论

- This paper introduced two different novel deep architectures that predict future frames that are dependent on actions and showed qualitatively and quantitatively that they are able to predict visuallyrealistic and useful-for-control frames over 100-step futures on several Atari game domains.
- Since the architectures were domain independent the authors expect that they will generalize to many vision-based RL problems.

- Table1: Average game score of DQN over 100 plays with standard error. The first row and the second row show the performance of our DQN replication with different exploration strategies

相关工作

- Video Prediction using Deep Networks. The problem of video prediction has led to a variety of architectures in deep learning. A recurrent temporal restricted Boltzmann machine (RTRBM) [29] was proposed to learn temporal correlations from sequential data by introducing recurrent connections in RBM. A structured RTRBM (sRTRBM) [20] scaled up RTRBM by learning dependency structures between observations and hidden variables from data. More recently, Michalski et al [19] proposed a higher-order gated autoencoder that defines multiplicative interactions between consecutive frames and mapping units, and showed that temporal prediction problem can be viewed as learning and inferring higher-order interactions between consecutive images. Srivastava et al [28] applied a sequence-to-sequence learning framework [31] to a video domain, and showed that long short-term memory (LSTM) [12] networks are capable of generating video of bouncing handwritten digits. In contrast to these previous studies, this paper tackles problems where control variables affect temporal dynamics, and in addition scales up spatio-temporal prediction to larger-size images.

基金

- This work was supported by NSF grant IIS-1526059, Bosch Research, and ONR grant N00014-13-1-0762

引用论文

- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- M. G. Bellemare, J. Veness, and M. Bowling. Investigating contingency awareness using Atari 2600 games. In AAAI, 2012.
- M. G. Bellemare, J. Veness, and M. Bowling. Bayesian learning of recursively factored environments. In ICML, 2013.
- M. G. Bellemare, J. Veness, and E. Talvitie. Skip context tree switching. In ICML, 2014.
- Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
- D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.
- A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In NIPS, 2014.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- L. Kocsis and C. Szepesvari. Bandit based Monte-Carlo planning. In ECML. 2006.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- I. Lenz, R. Knepper, and A. Saxena. DeepMPC: Learning deep latent features for model predictive control. In RSS, 2015.
- R. Memisevic. Learning to relate images. IEEE TPAMI, 35(8):1829–1846, 2013.
- V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent grammar cells. In NIPS, 2014.
- R. Mittelman, B. Kuipers, S. Savarese, and H. Lee. Structured recurrent temporal restricted Boltzmann machines. In ICML, 2014.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
- S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In ICML, 2014.
- S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial expression recognition. In ECCV. 2012.
- J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
- J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2:125–134, 1991.
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
- I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted Boltzmann machine. In NIPS, 2009.
- I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
- I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
- G. W. Taylor and G. E. Hinton. Factored conditional restricted Boltzmann machines for modeling motion style. In ICML, 2009.
- T. Tieleman and G. Hinton. Lecture 6.5 - RMSProp: Divde the gradient by a running average of its recent magnitude. Coursera, 2012.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
- C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- J. Yang, S. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In NIPS, 2015.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn