## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

NIPS 2020, (2020)

EI

Keywords

Abstract

We introduce the -model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with -models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The model, trained with a g...More

Code:

Data:

Introduction

- The common ingredient in all of model-based reinforcement learning is the dynamics model: a function used for predicting future states.
- Generalized rollouts and value estimation Probabilistic prediction horizons lead to generalizations of the core procedures of model-based reinforcement learning.
- Both the -model and the successor representation circumvent the compounding prediction errors that occur with single-step models during long model-based rollouts.

Highlights

- The common ingredient in all of model-based reinforcement learning is the dynamics model: a function used for predicting future states
- Can we avoid choosing a prediction horizon altogether? Value functions already do so by modeling the cumulative return over a discounted long-term future instead of an immediate reward, circumventing the need to commit to any single finite horizon
- Generalized rollouts and value estimation Probabilistic prediction horizons lead to generalizations of the core procedures of model-based reinforcement learning
- Converting the tabular successor representation into a continuous generative model is non-trivial because the successor representation implicitly assumes the ability to normalize over a finite state space for interpretation as a predictive model. Both the -model and the successor representation circumvent the compounding prediction errors that occur with single-step models during long model-based rollouts
- In the case of bootstrapped maximum likelihood problems, our target distribution is induced by the model itself, meaning that we only need sample access to our -model in order to train μ✓ as a generative adversarial network (GAN)
- Our experimental evaluation shows that, on tasks with low to moderate dimensionality, our method learns accurate long-horizon predictive distributions without sequential rollouts and can be incorporated into standard model-based reinforcement learning methods to produce results that are competitive with state-of-the-art algorithms

Results

- Figure 2 visually depicts the reweighting scheme and the number of steps required for truncated model rollouts to approximate the distribution induced by a larger discount.
- The -MVE estimator allows them to perform -model-based rollouts with horizon H, reweight the samples from this rollout by solving for weights ↵n given a desired discount ° , and correct for the truncation error stemming from the finite rollout length using a terminal value function with discount.
- In the case of bootstrapped maximum likelihood problems, the target distribution is induced by the model itself, meaning that the authors only need sample access to our -model in order to train μ✓ as a generative adversarial network (GAN).
- T pst, at, st1, seq “ log p1 ́ qppse | st, atqμ✓pse | st1q requires density evaluation of our -model, and the single-step transition distribution.
- The authors' experimental evaluation is designed to study the viability of -models as a replacement of conventional single-step models for long-horizon state prediction and model-based control.
- Figure 3 shows the predictions of a -model trained as a normalizing flow according to Algorithm 2 for five different discounts, ranging from “ 0 to “ 0.95.
- The authors visualize this relation in Figure 4, which depicts -model predictions on the pendulum environment for a discount of “ 0.99 and the resulting value map estimated by taking expectations over these predicted state distributions.
- It is policy-conditioned and infinite-horizon, like a value function, but independent of reward, like a standard single-step model.

Conclusion

- This new formulation of infinite-horizon prediction allows them to generalize the procedures integral to model-based control, yielding new variants of model rollouts and model-based value estimation.
- The authors' experimental evaluation shows that, on tasks with low to moderate dimensionality, the method learns accurate long-horizon predictive distributions without sequential rollouts and can be incorporated into standard model-based reinforcement learning methods to produce results that are competitive with state-of-the-art algorithms.
- The authors are optimistic for the long-term viability of temporal difference learning as an algorithm for training long-horizon dynamics models given its empirical success in long-horizon model-free control.

Related work

- The complementary strengths and weaknesses of model-based and model-free reinforcement learning have led to a number of works that attempt to combine these approaches. Common strategies include initializing a model-free algorithm with the solution found by a model-based planner (Levine & Koltun, 2013; Farshidian et al, 2014; Nagabandi et al, 2018), feeding model-generated data into an otherwise model-free optimizer (Sutton, 1990; Silver et al, 2008; Lampe & Riedmiller, 2014; Kalweit & Boedecker, 2017; Luo et al, 2019), using model predictions to improve the quality of target values for temporal difference learning (Buckman et al, 2018; Feinberg et al, 2018), leveraging model gradients for backpropagation (Nguyen & Widrow, 1990; Jordan & Rumelhart, 1992; Heess et al, 2015), and incorporating model-based planning without explicitly predicting future observations (Tamar et al, 2016; Silver et al, 2017; Oh et al, 2017; Kahn et al, 2018; Amos et al, 2018; Schrittwieser et al, 2019). In contrast to combining independent model-free and model-based components, we describe a framework for training a new class of predictive model with a generative, model-based reinterpretation of model-free tools.

Temporal difference models (TDMs) Pong et al (2018) provide an alternative method of training models with what are normally considered to be model-free algorithms. TDMs interpret models as a special case of goal-conditioned value functions (Kaelbling, 1993; Foster & Dayan, 2002; Schaul et al, 2015; Andrychowicz et al, 2017), though the TDM is constrained to predict at a fixed horizon and is limited to tasks for which the reward depends only on the last state. In contrast, the -model predicts over a discounted infinite-horizon future and accommodates arbitrary rewards.

Funding

- This work was partially supported by computational resource donations from Amazon
- M.J. is supported by fellowships from the National Science Foundation and the Open Philanthropy Project

Reference

- Amos, B., Rodriguez, I. D. J., Sacks, J., Boots, B., and Kolter, J. Z. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018.
- Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems. 2017.
- Asadi, K., Misra, D., Kim, S., and Littman, M. L. Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320, 2019.
- Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems 30. 2017.
- Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., and Munos, R. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In Proceedings of the International Conference on Machine Learning, 2018.
- Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, 2015.
- Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, 2018.
- Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems. 2018.
- Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5:613, 1993.
- Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems. 2019.
- Farshidian, F., Neunert, M., and Buchli, J. Learning of closed-loop motion control. In International Conference on Intelligent Robots and Systems, 2014.
- Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning. In International Conference on Machine Learning, 2018.
- Foster, D. and Dayan, P. Structure in the space of value functions. Machine Learning, 49:325, 2002.
- Gershman, S. J. The successor representation: Its computational logic and neural substrates. Journal of Neuroscience, 2018.
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2014.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
- Hansen, S., Dabney, W., Barreto, A., Warde-Farley, D., de Wiele, T. V., and Mnih, V. Fast task inference with variational intrinsic successor features. In International Conference on Learning Representations, 2020.
- Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
- Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
- Jordan, M. I. and Rumelhart, D. E. Forward models: Supervised learning with a distal teacher. Cognitive Science, 16:307, 1992.
- Kaelbling, L. P. Learning to achieve goals. In Proceedings of the International Joint Conference on Artificial Intelligence, 1993.
- Kahn, G., Villaflor, A., Ding, B., Abbeel, P., and Levine, S. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In International Conference on Robotics and Automation, 2018.
- Kalweit, G. and Boedecker, J. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, 2017.
- Kulkarni, T. D., Saeedi, A., Gautam, S., and Gershman, S. J. Deep successor reinforcement learning, 2016.
- Kumar, A., Fu, J., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.
- Lampe, T. and Riedmiller, M. Approximate model-assisted neural fitted Q-iteration. In International Joint Conference on Neural Networks, 2014.
- Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning, 2013.
- Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, 2019.
- Ma, C., Wen, J., and Bengio, Y. Universal successor representations for transfer reinforcement learning. arXiv preprint arXiv:1804.03758, 2018.
- Mao, X., Li, Q., Xie, H., Lau, R. Y. K., and Wang, Z. Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076, 2016.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 2015.
- Momennejad, I., Russek, E. M., Cheong, J. H., Botvinick, M. M., Daw, N. D., and Gershman, S. J. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9): 680–692, 2017.
- Moore, A. W. Efficient Memory-based Learning for Robot Control. PhD thesis, University of Cambridge, 1990.
- Nagabandi, A., Kahn, G., S. Fearing, R., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In International Conference on Robotics and Automation, 2018.
- Nguyen, D. H. and Widrow, B. Neural networks for self-learning control systems. IEEE Control Systems Magazine, 1990.
- Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems. 2016.
- Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems, 2017.
- Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep RL for model-based control. In International Conference on Learning Representations, 2018.
- Rezende, D. and Mohamed, S. Variational inference with normalizing flows. Proceedings of Machine Learning Research, 2015.
- Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In Proceedings of the International Conference on Machine Learning, 2015.
- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver, D., Sutton, R. S., and Müller, M. Sample-based learning and search with permanent and transient memories. In Proceedings of the International Conference on Machine Learning, 2008.
- Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. The predictron: End-to-end learning and planning. In International Conference on Machine Learning, 2017.
- Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, 1990.
- Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems. 1996.
- Talvitie, E. Model regularization for stable sample rollouts. In Conference on Uncertainty in Artificial Intelligence, 2014.
- Talvitie, E. Self-correcting models for model-based reinforcement learning. In AAAI Conference on Artificial Intelligence, 2016.
- Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems. 2016.
- Venkatraman, A., Capobianco, R., Pinto, L., Hebert, M., Nardi, D., and Bagnell, J. A. Improved learning of dynamics for control. In Proceedings of International Symposium on Experimental Robotics. 2016.
- Whitney, W. and Fergus, R. Understanding the asymptotic performance of model-based RL methods. 2018.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn