Model Based Reinforcement Learning for Atari

Łukasz Kaiser
Łukasz Kaiser
Mohammad Babaeizadeh
Mohammad Babaeizadeh
Piotr Miłos
Piotr Miłos
Błażej Osiński
Błażej Osiński
Konrad Czechowski
Konrad Czechowski
Piotr Kozakowski
Piotr Kozakowski
Afroz Mohiuddin
Afroz Mohiuddin
Ryan Sepassi
Ryan Sepassi

ICLR, 2020.

Cited by: 82|Bibtex|Views401
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We presented Simulated Policy Learning, a model-based reinforcement learning approach that operates directly on raw pixel observations and learns effective policies to play games in the Atari Learning Environment

Abstract:

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of th...More
0
Introduction
  • Human players can learn to play Atari games in minutes (Tsividis et al, 2017).
  • Some of the best model-free reinforcement learning algorithms require tens or hundreds of millions of time steps – the equivalent of several weeks of training in real time.
  • How is it that humans can learn these games so much faster?
  • In a recent survey (Section 7.2 in Machado et al (2018)) this was formulated as the following challenge: “So far, there has been no clear demonstration of successful planning with a learned model in the ALE”
Highlights
  • Human players can learn to play Atari games in minutes (Tsividis et al, 2017)
  • We present an approach, called Simulated Policy Learning (SimPLe), that utilizes these video prediction techniques and trains a policy to play the game within the learned model
  • We evaluate Simulated Policy Learning on a suite of Atari games from Atari Learning Environment (ALE) benchmark
  • We evaluate our method on 26 games selected on the basis of being solvable with existing state-ofthe-art model-free deep reinforcement learning algorithms2, which in our comparisons are Rainbow Hessel et al (2018) and policy optimization Schulman et al (2017)
  • We presented Simulated Policy Learning, a model-based reinforcement learning approach that operates directly on raw pixel observations and learns effective policies to play games in the Atari Learning Environment
  • Our experiments demonstrate that Simulated Policy Learning learns to play many of the games with just 100K interactions with the environment, corresponding to 2 hours of play time
Methods
  • The authors evaluate SimPLe on a suite of Atari games from Atari Learning Environment (ALE) benchmark.
  • Because some data is collected before the first iteration of the loop, altogether 6400 · 16 = 102, 400 interactions with the Atari environment are used during training.
  • This is equivalent to 409, 600 frames from the Atari game (114 minutes at 60 FPS).
  • Due to vast difference between number of training data from simulated environment and real environment (15M vs 100K) the impact of the latter on policy is negligible
Results
  • The authors present numerical results of the experiments.
  • In Table 2 the authors present the mean and standard deviation of the 5 experiments.
  • The authors observed that the median behaves rather which is reported it in Table 4.
  • In this table the authors show maximal scores over 5 runs.
  • (a) SimPLe compared to Rainbow at 100K.
  • (d) SimPLe compared to PPO at 200K
  • (b) SimPLe compared to Rainbow at 200K (c) SimPLe compared to PPO at 100K.
Conclusion
  • The authors presented SimPLe, a model-based reinforcement learning approach that operates directly on raw pixel observations and learns effective policies to play games in the Atari Learning Environment.
  • The representation learned by the predictive model is likely be more meaningful by itself than the raw pixel observations from the environment.
  • Incorporating this representation into the policy could further accelerate and improve the reinforcement learning process
Summary
  • Introduction:

    Human players can learn to play Atari games in minutes (Tsividis et al, 2017).
  • Some of the best model-free reinforcement learning algorithms require tens or hundreds of millions of time steps – the equivalent of several weeks of training in real time.
  • How is it that humans can learn these games so much faster?
  • In a recent survey (Section 7.2 in Machado et al (2018)) this was formulated as the following challenge: “So far, there has been no clear demonstration of successful planning with a learned model in the ALE”
  • Methods:

    The authors evaluate SimPLe on a suite of Atari games from Atari Learning Environment (ALE) benchmark.
  • Because some data is collected before the first iteration of the loop, altogether 6400 · 16 = 102, 400 interactions with the Atari environment are used during training.
  • This is equivalent to 409, 600 frames from the Atari game (114 minutes at 60 FPS).
  • Due to vast difference between number of training data from simulated environment and real environment (15M vs 100K) the impact of the latter on policy is negligible
  • Results:

    The authors present numerical results of the experiments.
  • In Table 2 the authors present the mean and standard deviation of the 5 experiments.
  • The authors observed that the median behaves rather which is reported it in Table 4.
  • In this table the authors show maximal scores over 5 runs.
  • (a) SimPLe compared to Rainbow at 100K.
  • (d) SimPLe compared to PPO at 200K
  • (b) SimPLe compared to Rainbow at 200K (c) SimPLe compared to PPO at 100K.
  • Conclusion:

    The authors presented SimPLe, a model-based reinforcement learning approach that operates directly on raw pixel observations and learns effective policies to play games in the Atari Learning Environment.
  • The representation learned by the predictive model is likely be more meaningful by itself than the raw pixel observations from the environment.
  • Incorporating this representation into the policy could further accelerate and improve the reinforcement learning process
Tables
  • Table1: Summary of SimPLe ablations. For each game, a configuration was assigned a score being the mean over 5 experiments. The best and median scores were calculated per game. The table reports the number of games a given configuration achieved the best score or at least the median score, respectively
  • Table2: Models comparison. Mean scores and standard deviations over five training runs. Right most columns presents score for random agent and human
  • Table3: Comparison of our method (SimPLe) with model-free benchmarks - PPO and Rainbow, trained with 100 thousands/500 thousands/1 million steps. (1 step equals 4 frames)
  • Table4: Models comparison. Scores of median (left) and best (right) models out of five training runs. Right most columns presents score for random agent and human
Download tables as Excel
Related work
  • Atari games gained prominence as a benchmark for reinforcement learning with the introduction of the Arcade Learning Environment (ALE) Bellemare et al (2015). The combination of reinforcement learning and deep models then enabled RL algorithms to learn to play Atari games directly from images of the game screen, using variants of the DQN algorithm (Mnih et al, 2013; 2015; Hessel et al, 2018) and actor-critic algorithms (Mnih et al, 2016; Schulman et al, 2017; Babaeizadeh et al, 2017b; Wu et al, 2017; Espeholt et al, 2018). The most successful methods in this domain remain model-free algorithms (Hessel et al, 2018; Espeholt et al, 2018). Although the sample complexity of these methods has substantially improved recently, it remains far higher than the amount of experience required for human players to learn each game (Tsividis et al, 2017). In this work, we aim to learn Atari games with a budget of just 100K agent steps (400K frames), corresponding to about two hours of play time. Prior methods are generally not evaluated in this regime, and we therefore optimized Rainbow (Hessel et al, 2018) for optimal performance on 1M steps, see Appendix E for details.
Funding
  • The work of Konrad Czechowski, Piotr Kozakowski and Piotr Miłoswas supported by the Polish National Science Center grants UMO-2017/26/E/ST6/00622
  • The work of Henryk Michalewski was supported by the Polish National Science Center grant UMO-2018/29/B/ST6/02959
  • This research was supported by the PL-Grid Infrastructure
Reference
  • Stephan Alaniz. Deep reinforcement learning with model learning and monte carlo tree search in minecraft. arXiv preprint arXiv:1803.08456, 2018.
    Findings
  • Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Sample-efficient deep RL with generative adversarial tree search. CoRR, abs/1806.05780, 2018.
    Findings
  • Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic variational video prediction. ICLR, 2017a.
    Google ScholarFindings
  • Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. Reinforcement learning through asynchronous advantage actor-critic on a GPU. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017b. URL https://openreview.net/forum?id=r1VGvBcxl.
    Locate open access versionFindings
  • Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI, pp. 4148–4152, 2015.
    Google ScholarLocate open access versionFindings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1171–1179, 2015.
    Google ScholarLocate open access versionFindings
  • Lars Buesing, Theophane Weber, Yori Zwols, Nicolas Heess, Sébastien Racanière, Arthur Guez, and Jean-Baptiste Lespiau. Woulda, coulda, shoulda: Counterfactually-guided policy search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=BJG0voC9YQ.
    Locate open access versionFindings
  • Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A research framework for deep reinforcement learning. CoRR, abs/1812.06110, 2018.
    Findings
  • Silvia Chiappa, Sébastien Racanière, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=B1s6xvqlx.
    Locate open access versionFindings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4759–4770, 2018.
    Google ScholarLocate open access versionFindings
  • Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2), 2013.
    Google ScholarLocate open access versionFindings
  • Frederik Ebert, Chelsea Finn, Alex X. Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In 1st Annual Conference on Robot Learning, CoRL 2017, Mountain View, California, USA, November 13-15, 2017, Proceedings, volume 78 of Proceedings of Machine Learning Research, pp. 344–356. PMLR, 2017.
    Google ScholarLocate open access versionFindings
  • Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
    Findings
  • Mustafa Ersen and Sanem Sariel. Learning behaviors of and interactions among objects through spatio–temporal reasoning. IEEE Transactions on Computational Intelligence and AI in Games, 7 (1):75–87, 2014.
    Google ScholarLocate open access versionFindings
  • Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 1406–1415, 2018.
    Google ScholarLocate open access versionFindings
  • Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. CoRR, abs/1803.00101, 2018.
    Findings
  • Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, May 29 - June 3, 2017, pp. 2786–2793. IEEE, 20doi: 10.1109/ICRA.2017.7989324.
    Locate open access versionFindings
  • Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation, ICRA, pp. 512–519, 2016.
    Google ScholarLocate open access versionFindings
  • Matthew Guzdial, Boyang Li, and Mark O. Riedl. Game engine learning from video. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 3707–3713, 2017. doi: 10.24963/ijcai.2017/518.
    Locate open access versionFindings
  • David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 2455–2467, 2018.
    Google ScholarLocate open access versionFindings
  • Danijar Hafner, Timothy P. Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2555–2565. PMLR, 2019.
    Google ScholarLocate open access versionFindings
  • Nicolas Heess, Gregory Wayne, David Silver, Timothy P. Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2944–2952, 2015.
    Google ScholarLocate open access versionFindings
  • Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Sheila A. McIlraith and Kilian Q. Weinberger (eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3215–3222. AAAI Press, 2018.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • G. Zacharias Holland, Erik Talvitie, and Michael Bowling. The effect of planning shape on dyna-style planning in high-dimensional state spaces. CoRR, abs/1806.01825, 2018.
    Findings
  • Lukasz Kaiser and Samy Bengio. Discrete autoencoders for sequence models. CoRR, abs/1801.09797, 2018.
    Findings
  • Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg (eds.), Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pp. 195–206. PMLR, 13–15 Nov 2017.
    Google ScholarLocate open access versionFindings
  • Kacper Piotr Kielak. Do recent advancements in model-based deep reinforcement learning really improve data efficiency?, 2020. URL https://openreview.net/forum?id= Bke9u1HFwB.
    Findings
  • Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
    Google ScholarLocate open access versionFindings
  • Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=SJJinbWRZ.
    Locate open access versionFindings
  • Felix Leibfried, Nate Kushman, and Katja Hofmann. A deep learning approach for joint video frame and reward prediction in Atari games. CoRR, abs/1611.07078, 2016.
    Findings
  • Felix Leibfried, Rasul Tutunov, Peter Vrancx, and Haitham Bou-Ammar. Model-based regularization for deep reinforcement learning with transcoder networks. arXiv preprint arXiv:1809.01906, 2018.
    Findings
  • Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. J. Artif. Intell. Res., 61:523–562, 2018. doi: 10.1613/jair.5699.
    Locate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. Actionconditional video prediction using deep networks in atari games. In NIPS, pp. 2863–2871, 2015.
    Google ScholarLocate open access versionFindings
  • Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6118–6128. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • Chris Paxton, Yotam Barnoy, Kapil D. Katyal, Raman Arora, and Gregory D. Hager. Visual robot task planning. In International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20-24, 2019, pp. 8832–8838. IEEE, 2019. doi: 10.1109/ICRA.2019.8793736.
    Locate open access versionFindings
  • A. J. Piergiovanni, Alan Wu, and Michael S. Ryoo. Learning real-world robot policies by dreaming. CoRR, abs/1805.07813, 2018.
    Findings
  • Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Vecerík, Matteo Hessel, Rémi Munos, and Olivier Pietquin. Observe and look further: Achieving consistent performance on atari. CoRR, abs/1805.11593, 2018.
    Findings
  • Oleh Rybkin, Karl Pertsch, Andrew Jaegle, Konstantinos G. Derpanis, and Kostas Daniilidis. Unsupervised learning of sensorimotor affordances by stochastic future prediction. CoRR, abs/1806.09655, 2018.
    Findings
  • Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Trans. Autonomous Mental Development, 2(3):230–247, 2010.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
    Findings
  • Shagun Sodhani, Anirudh Goyal, Tristan Deleu, Yoshua Bengio, Sergey Levine, and Jian Tang. Learning powerful policies by using consistent dynamics model. arXiv preprint arXiv:1906.04355, 2019.
    Findings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull., 2(4):160–163, July 1991.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton and Andrew G. Barto. Reinforcement learning - an introduction, 2nd edition (work in progress). Adaptive computation and machine learning. MIT Press, 2017.
    Google ScholarFindings
  • Pedro Tsividis, Thomas Pouncy, Jaqueline L. Xu, Joshua B. Tenenbaum, and Samuel J. Gershman. Human learning in atari. In 2017 AAAI Spring Symposia, Stanford University, Palo Alto, California, USA, March 27-29, 2017, 2017.
    Google ScholarLocate open access versionFindings
  • Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6306–6315, 2017.
    Google ScholarLocate open access versionFindings
  • Hado van Hasselt, Matteo Hessel, and John Aslanides. When to use parametric models in reinforcement learning? In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 14322–14333, 2019.
    Google ScholarLocate open access versionFindings
  • Arun Venkatraman, Roberto Capobianco, Lerrel Pinto, Martial Hebert, Daniele Nardi, and J. Andrew Bagnell. Improved learning of dynamics models for control. In International Symposium on Experimental Robotics, ISER 2016, Tokyo, Japan, October 3-6, 2016., pp. 703–713, 2016.
    Google ScholarLocate open access versionFindings
  • Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. CoRR, abs/1907.02057, 2019.
    Findings
  • Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin A. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pp. 2746–2754, 2015.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments