AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Can we avoid choosing a prediction horizon altogether? Value functions already do so by modeling the cumulative return over a discounted long-term future instead of an immediate reward, circumventing the need to commit to any single finite horizon

Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

NIPS 2020, (2020)

Cited by: 0|Views53
EI
Full Text
Bibtex
Weibo

Abstract

We introduce the -model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with -models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The model, trained with a g...More

Code:

Data:

0
Introduction
  • The common ingredient in all of model-based reinforcement learning is the dynamics model: a function used for predicting future states.
  • Generalized rollouts and value estimation Probabilistic prediction horizons lead to generalizations of the core procedures of model-based reinforcement learning.
  • Both the -model and the successor representation circumvent the compounding prediction errors that occur with single-step models during long model-based rollouts.
Highlights
  • The common ingredient in all of model-based reinforcement learning is the dynamics model: a function used for predicting future states
  • Can we avoid choosing a prediction horizon altogether? Value functions already do so by modeling the cumulative return over a discounted long-term future instead of an immediate reward, circumventing the need to commit to any single finite horizon
  • Generalized rollouts and value estimation Probabilistic prediction horizons lead to generalizations of the core procedures of model-based reinforcement learning
  • Converting the tabular successor representation into a continuous generative model is non-trivial because the successor representation implicitly assumes the ability to normalize over a finite state space for interpretation as a predictive model. Both the -model and the successor representation circumvent the compounding prediction errors that occur with single-step models during long model-based rollouts
  • In the case of bootstrapped maximum likelihood problems, our target distribution is induced by the model itself, meaning that we only need sample access to our -model in order to train μ✓ as a generative adversarial network (GAN)
  • Our experimental evaluation shows that, on tasks with low to moderate dimensionality, our method learns accurate long-horizon predictive distributions without sequential rollouts and can be incorporated into standard model-based reinforcement learning methods to produce results that are competitive with state-of-the-art algorithms
Results
  • Figure 2 visually depicts the reweighting scheme and the number of steps required for truncated model rollouts to approximate the distribution induced by a larger discount.
  • The -MVE estimator allows them to perform -model-based rollouts with horizon H, reweight the samples from this rollout by solving for weights ↵n given a desired discount ° , and correct for the truncation error stemming from the finite rollout length using a terminal value function with discount.
  • In the case of bootstrapped maximum likelihood problems, the target distribution is induced by the model itself, meaning that the authors only need sample access to our -model in order to train μ✓ as a generative adversarial network (GAN).
  • T pst, at, st1, seq “ log p1 ́ qppse | st, atqμ✓pse | st1q requires density evaluation of our -model, and the single-step transition distribution.
  • The authors' experimental evaluation is designed to study the viability of -models as a replacement of conventional single-step models for long-horizon state prediction and model-based control.
  • Figure 3 shows the predictions of a -model trained as a normalizing flow according to Algorithm 2 for five different discounts, ranging from “ 0 to “ 0.95.
  • The authors visualize this relation in Figure 4, which depicts -model predictions on the pendulum environment for a discount of “ 0.99 and the resulting value map estimated by taking expectations over these predicted state distributions.
  • It is policy-conditioned and infinite-horizon, like a value function, but independent of reward, like a standard single-step model.
Conclusion
  • This new formulation of infinite-horizon prediction allows them to generalize the procedures integral to model-based control, yielding new variants of model rollouts and model-based value estimation.
  • The authors' experimental evaluation shows that, on tasks with low to moderate dimensionality, the method learns accurate long-horizon predictive distributions without sequential rollouts and can be incorporated into standard model-based reinforcement learning methods to produce results that are competitive with state-of-the-art algorithms.
  • The authors are optimistic for the long-term viability of temporal difference learning as an algorithm for training long-horizon dynamics models given its empirical success in long-horizon model-free control.
Related work
Funding
  • This work was partially supported by computational resource donations from Amazon
  • M.J. is supported by fellowships from the National Science Foundation and the Open Philanthropy Project
Reference
  • Amos, B., Rodriguez, I. D. J., Sacks, J., Boots, B., and Kolter, J. Z. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems. 2017.
    Google ScholarLocate open access versionFindings
  • Asadi, K., Misra, D., Kim, S., and Littman, M. L. Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320, 2019.
    Findings
  • Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems 30. 2017.
    Google ScholarLocate open access versionFindings
  • Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., and Munos, R. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In Proceedings of the International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems. 2018.
    Google ScholarLocate open access versionFindings
  • Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5:613, 1993.
    Google ScholarLocate open access versionFindings
  • Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems. 2019.
    Google ScholarLocate open access versionFindings
  • Farshidian, F., Neunert, M., and Buchli, J. Learning of closed-loop motion control. In International Conference on Intelligent Robots and Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Foster, D. and Dayan, P. Structure in the space of value functions. Machine Learning, 49:325, 2002.
    Google ScholarLocate open access versionFindings
  • Gershman, S. J. The successor representation: Its computational logic and neural substrates. Journal of Neuroscience, 2018.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2014.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Hansen, S., Dabney, W., Barreto, A., Warde-Farley, D., de Wiele, T. V., and Mnih, V. Fast task inference with variational intrinsic successor features. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Jordan, M. I. and Rumelhart, D. E. Forward models: Supervised learning with a distal teacher. Cognitive Science, 16:307, 1992.
    Google ScholarLocate open access versionFindings
  • Kaelbling, L. P. Learning to achieve goals. In Proceedings of the International Joint Conference on Artificial Intelligence, 1993.
    Google ScholarLocate open access versionFindings
  • Kahn, G., Villaflor, A., Ding, B., Abbeel, P., and Levine, S. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In International Conference on Robotics and Automation, 2018.
    Google ScholarLocate open access versionFindings
  • Kalweit, G. and Boedecker, J. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Kulkarni, T. D., Saeedi, A., Gautam, S., and Gershman, S. J. Deep successor reinforcement learning, 2016.
    Google ScholarFindings
  • Kumar, A., Fu, J., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Lampe, T. and Riedmiller, M. Approximate model-assisted neural fitted Q-iteration. In International Joint Conference on Neural Networks, 2014.
    Google ScholarLocate open access versionFindings
  • Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning, 2013.
    Google ScholarLocate open access versionFindings
  • Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Ma, C., Wen, J., and Bengio, Y. Universal successor representations for transfer reinforcement learning. arXiv preprint arXiv:1804.03758, 2018.
    Findings
  • Mao, X., Li, Q., Xie, H., Lau, R. Y. K., and Wang, Z. Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076, 2016.
    Findings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • Momennejad, I., Russek, E. M., Cheong, J. H., Botvinick, M. M., Daw, N. D., and Gershman, S. J. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9): 680–692, 2017.
    Google ScholarLocate open access versionFindings
  • Moore, A. W. Efficient Memory-based Learning for Robot Control. PhD thesis, University of Cambridge, 1990.
    Google ScholarFindings
  • Nagabandi, A., Kahn, G., S. Fearing, R., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In International Conference on Robotics and Automation, 2018.
    Google ScholarLocate open access versionFindings
  • Nguyen, D. H. and Widrow, B. Neural networks for self-learning control systems. IEEE Control Systems Magazine, 1990.
    Google ScholarLocate open access versionFindings
  • Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems. 2016.
    Google ScholarLocate open access versionFindings
  • Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep RL for model-based control. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Rezende, D. and Mohamed, S. Variational inference with normalizing flows. Proceedings of Machine Learning Research, 2015.
    Google ScholarLocate open access versionFindings
  • Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In Proceedings of the International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
    Findings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Silver, D., Sutton, R. S., and Müller, M. Sample-based learning and search with permanent and transient memories. In Proceedings of the International Conference on Machine Learning, 2008.
    Google ScholarLocate open access versionFindings
  • Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. The predictron: End-to-end learning and planning. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, 1990.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems. 1996.
    Google ScholarLocate open access versionFindings
  • Talvitie, E. Model regularization for stable sample rollouts. In Conference on Uncertainty in Artificial Intelligence, 2014.
    Google ScholarLocate open access versionFindings
  • Talvitie, E. Self-correcting models for model-based reinforcement learning. In AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems. 2016.
    Google ScholarLocate open access versionFindings
  • Venkatraman, A., Capobianco, R., Pinto, L., Hebert, M., Nardi, D., and Bagnell, J. A. Improved learning of dynamics for control. In Proceedings of International Symposium on Experimental Robotics. 2016.
    Google ScholarLocate open access versionFindings
  • Whitney, W. and Fergus, R. Understanding the asymptotic performance of model-based RL methods. 2018.
    Google ScholarFindings
Author
Michael Janner
Michael Janner
Igor Mordatch
Igor Mordatch
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科