Action-Conditional Video Prediction using Deep Networks in Atari Games
Annual Conference on Neural Information Processing Systems, (2015): 2863-2871
Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future image-frames depend on control variables or actions as well as previous frames. While not composed of natural scenes, frames in...更多
下载 PDF 全文
- Deep learning approaches have shown great success in many visual perception problems (e.g., [16, 7, 32, 9]).
- The authors focus on Atari games from the Arcade Learning Environment (ALE)  as a source of challenging action-conditional video modeling problems.
- While not composed of natural scenes, frames in Atari games are high-dimensional, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability.
- To the best of the knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional images conditioned by control inputs
- Over the years, deep learning approaches have shown great success in many visual perception problems (e.g., [16, 7, 32, 9])
- In vision-based reinforcement learning (RL) problems, learning to predict future images conditioned on actions amounts to learning a model of the dynamics of the agent-environment interaction, an essential component of model-based approaches to reinforcement learning
- We focus on Atari games from the Arcade Learning Environment (ALE)  as a source of challenging action-conditional video modeling problems
- While not composed of natural scenes, frames in Atari games are high-dimensional, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability
- An example of long-term predictions is illustrated in Figure 2
- This paper introduced two different novel deep architectures that predict future frames that are dependent on actions and showed qualitatively and quantitatively that they are able to predict visuallyrealistic and useful-for-control frames over 100-step futures on several Atari game domains
- In the experiments that follow, the authors have the following goals for the two architectures. 1) To evaluate the predicted frames in two ways: qualitatively evaluating the generated video, and quantitatively evaluating the pixel-based squared error, 2) To evaluate the usefulness of predicted frames for control in two ways: by replacing the emulator’s frames with predicted frames for use by DQN, and by using the predictions to improve exploration in DQN, and 3) To analyze the representations learned by the architectures.
- Data and Preprocessing.
- The authors used the replication of DQN to generate game-play video datasets using an -greedy policy with = 0.3, i.e. DQN is forced to choose a random action with 30% probability.
- The dataset consists of about 500, 000 training frames and 50, 000 test frames with actions chosen by DQN.
- Following DQN, actions are chosen once every 4 frames which reduces the video from 60fps to 15fps.
- The authors used full-resolution RGB images (210 × 160) and preprocessed the images by subtracting mean pixel values and dividing each pixel value by 255
- Evaluation of Predicted Frames
Qualitative Evaluation: Prediction video. The prediction videos of the models and baselines are available in the supplementary material and at the following website: https://sites.google. com/a/umich.edu/junhyuk-oh/action-conditional-video-prediction.
- As seen in the videos, the proposed models make qualitatively reasonable predictions over 30–500 steps depending on the game.
- An example of long-term predictions is illustrated in Figure 2.
- The authors observed that both of the models predict complex local translations well such as the movement of vehicles and the controlled object.
- In Figure 2, the model predicts the sudden change of the location of the controlled object at 257-step
- This paper introduced two different novel deep architectures that predict future frames that are dependent on actions and showed qualitatively and quantitatively that they are able to predict visuallyrealistic and useful-for-control frames over 100-step futures on several Atari game domains.
- Since the architectures were domain independent the authors expect that they will generalize to many vision-based RL problems.
- Table1: Average game score of DQN over 100 plays with standard error. The first row and the second row show the performance of our DQN replication with different exploration strategies
- Video Prediction using Deep Networks. The problem of video prediction has led to a variety of architectures in deep learning. A recurrent temporal restricted Boltzmann machine (RTRBM)  was proposed to learn temporal correlations from sequential data by introducing recurrent connections in RBM. A structured RTRBM (sRTRBM)  scaled up RTRBM by learning dependency structures between observations and hidden variables from data. More recently, Michalski et al  proposed a higher-order gated autoencoder that defines multiplicative interactions between consecutive frames and mapping units, and showed that temporal prediction problem can be viewed as learning and inferring higher-order interactions between consecutive images. Srivastava et al  applied a sequence-to-sequence learning framework  to a video domain, and showed that long short-term memory (LSTM)  networks are capable of generating video of bouncing handwritten digits. In contrast to these previous studies, this paper tackles problems where control variables affect temporal dynamics, and in addition scales up spatio-temporal prediction to larger-size images.
- This work was supported by NSF grant IIS-1526059, Bosch Research, and ONR grant N00014-13-1-0762
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- M. G. Bellemare, J. Veness, and M. Bowling. Investigating contingency awareness using Atari 2600 games. In AAAI, 2012.
- M. G. Bellemare, J. Veness, and M. Bowling. Bayesian learning of recursively factored environments. In ICML, 2013.
- M. G. Bellemare, J. Veness, and E. Talvitie. Skip context tree switching. In ICML, 2014.
- Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
- D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, 2012.
- A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In NIPS, 2014.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- L. Kocsis and C. Szepesvari. Bandit based Monte-Carlo planning. In ECML. 2006.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- I. Lenz, R. Knepper, and A. Saxena. DeepMPC: Learning deep latent features for model predictive control. In RSS, 2015.
- R. Memisevic. Learning to relate images. IEEE TPAMI, 35(8):1829–1846, 2013.
- V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent grammar cells. In NIPS, 2014.
- R. Mittelman, B. Kuipers, S. Savarese, and H. Lee. Structured recurrent temporal restricted Boltzmann machines. In ICML, 2014.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
- S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In ICML, 2014.
- S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial expression recognition. In ECCV. 2012.
- J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
- J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2:125–134, 1991.
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
- I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted Boltzmann machine. In NIPS, 2009.
- I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
- I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
- G. W. Taylor and G. E. Hinton. Factored conditional restricted Boltzmann machines for modeling motion style. In ICML, 2009.
- T. Tieleman and G. Hinton. Lecture 6.5 - RMSProp: Divde the gradient by a running average of its recent magnitude. Coursera, 2012.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
- C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- J. Yang, S. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In NIPS, 2015.