Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning

Mohammad Babaeizadeh
Mohammad Babaeizadeh
Mohammad Taghi Saffar
Mohammad Taghi Saffar
Danijar Hafner
Danijar Hafner
Harini Kannan
Harini Kannan
Cited by: 0|Bibtex|Views23
Other Links: arxiv.org
Weibo:
We take the first steps towards analyzing the importance of model quality in visual model-based reinforcement learning and how various design decisions affect the overall performance of the agent

Abstract:

Model-based reinforcement learning (MBRL) methods have shown strong sample efficiency and performance across a variety of tasks, including when faced with high-dimensional visual observations. These methods learn to predict the environment dynamics and expected reward from interaction and use this predictive model to plan and perform th...More

Code:

Data:

0
Introduction
  • The key component of any model-based reinforcement learning (MBRL) methods is the predictive model.
  • In visual MBRL, this model predicts the future observations that will result from taking different actions, enabling the agent to select the actions that will lead to the most desirable outcomes
  • These features enable MBRL agents to perform successfully with high dataefficiency (Deisenroth & Rasmussen, 2011) in many tasks ranging from healthcare (Steyerberg et al, 2019), to robotics (Ebert et al, 2018), and playing board games (Schrittwieser et al, 2019).
  • Some methods model the dynamics of the environment in the latent space (Hafner et al, 2018), while some other approaches model autoregressive dynamics in the observation space (Kaiser et al, 2020)
Highlights
  • The key component of any model-based reinforcement learning (MBRL) methods is the predictive model
  • We find a strong correlation between image prediction accuracy and downstream task performance, suggesting video prediction is likely a fruitful area of research for improving visual MBRL
  • To isolate the design decisions pertaining to only the predictive model, we focus on MBRL methods that only learn a model, and plan through that model to select actions (Chua et al, 2018; Zhang et al, 2018; Hafner et al, 2018)
  • We take the first steps towards analyzing the importance of model quality in visual MBRL and how various design decisions affect the overall performance of the agent
  • We provide empirical evidence that predicting images can substantially improve task performance over only predicting the expected reward
  • One of our key findings is that predicting images improves the performance of the agent, suggesting a way to improve the task performance of these methods
  • They suggest that building the most powerful and accurate models may not necessarily be the right choice for attaining good exploration when learning from scratch. This result can be turned around to imply that models that perform best on benchmark tasks which require exploration may not be the most accurate models. This has considerable implication on MBRL methods that learn from previously collected offline data (Fujimoto et al, 2019; Wu et al, 2019; Agarwal et al, 2020) – a setting that is common for real-world applications e.g. in Robotics (Finn & Levine, 2017)
Methods
  • Studying the effects of each design decision independently is difficult due to the complex interactions between the components of MBRL (Sutton & Barto, 2018).
  • To isolate the design decisions pertaining to only the predictive model, the authors focus on MBRL methods that only learn a model, and plan through that model to select actions (Chua et al, 2018; Zhang et al, 2018; Hafner et al, 2018).
  • That given the limited planning horizon of CEM, the Oracle model is not necessarily optimal, in tasks with sparse and far reaching rewards
Results
  • One of the key findings is that predicting images improves the performance of the agent, suggesting a way to improve the task performance of these methods.
Conclusion
  • The authors take the first steps towards analyzing the importance of model quality in visual MBRL and how various design decisions affect the overall performance of the agent.
  • This result can be turned around to imply that models that perform best on benchmark tasks which require exploration may not be the most accurate models
  • This has considerable implication on MBRL methods that learn from previously collected offline data (Fujimoto et al, 2019; Wu et al, 2019; Agarwal et al, 2020) – a setting that is common for real-world applications e.g. in Robotics (Finn & Levine, 2017)
Summary
  • Introduction:

    The key component of any model-based reinforcement learning (MBRL) methods is the predictive model.
  • In visual MBRL, this model predicts the future observations that will result from taking different actions, enabling the agent to select the actions that will lead to the most desirable outcomes
  • These features enable MBRL agents to perform successfully with high dataefficiency (Deisenroth & Rasmussen, 2011) in many tasks ranging from healthcare (Steyerberg et al, 2019), to robotics (Ebert et al, 2018), and playing board games (Schrittwieser et al, 2019).
  • Some methods model the dynamics of the environment in the latent space (Hafner et al, 2018), while some other approaches model autoregressive dynamics in the observation space (Kaiser et al, 2020)
  • Objectives:

    The goal of this paper is to understand the trade-offs between the design choices of model-based agents.
  • The authors' goal is to analyze the design trade-offs in the models themselves, decoupling this as much as possible from the confounding differences in the algorithm.
  • Since the goal is to analyze the design trade-offs in the models themselves, the authors focus the analysis only on MBRL methods which train a model and use it for planning, rather than MBRL methods that learn policies or value functions
  • Methods:

    Studying the effects of each design decision independently is difficult due to the complex interactions between the components of MBRL (Sutton & Barto, 2018).
  • To isolate the design decisions pertaining to only the predictive model, the authors focus on MBRL methods that only learn a model, and plan through that model to select actions (Chua et al, 2018; Zhang et al, 2018; Hafner et al, 2018).
  • That given the limited planning horizon of CEM, the Oracle model is not necessarily optimal, in tasks with sparse and far reaching rewards
  • Results:

    One of the key findings is that predicting images improves the performance of the agent, suggesting a way to improve the task performance of these methods.
  • Conclusion:

    The authors take the first steps towards analyzing the importance of model quality in visual MBRL and how various design decisions affect the overall performance of the agent.
  • This result can be turned around to imply that models that perform best on benchmark tasks which require exploration may not be the most accurate models
  • This has considerable implication on MBRL methods that learn from previously collected offline data (Fujimoto et al, 2019; Wu et al, 2019; Agarwal et al, 2020) – a setting that is common for real-world applications e.g. in Robotics (Finn & Levine, 2017)
Tables
  • Table1: The asymptotic performance of various model designs in the offline setting. In this setting there is no exploration and the training dataset is pre-collected and fixed for all the models. The reported numbers are the 90th percentile out of 100 evaluation episodes, averaged across three different runs. This table indicates a significant performance improvement by predicting images almost across all the tasks. Moreover, there is a meaningful difference between the numbers in this table and Figure 2 signifying the importance of exploration in the online settings. Please note how some of best-performing models in this table performed poorly in Figure 2
  • Table2: Median reward prediction error (LR) of each model across all of the trajectories in evaluation partition of the offline dataset. This table demonstrates a generally better task performance for more accurate models in the offline setting, when compared with Table 1. The last row reports the Pearson correlation coefficient between the reward prediction error and the asymptotic performance for each task across models. This row demonstrates the strong correlation between reward prediction error (LR) and task performance (S) in the absence of exploration. In cases which all models are close to the maximum possible score of 1000 (such as ball in cup catch) the correlation can be misleading because a better prediction does not help the model anymore
  • Table3: Pearson correlation coefficient ρ between image prediction error LO, reward prediction error LR and asymptotic score S. To calculate the correlation, we scaled down OT OR and LT LR at multiple levels to limit their modeling capacity and thereby potentially increase their prediction error. μS and σS are the average and the standard deviation of asymptotic performances across different scales of each model. In cases with low standard deviation of the scores (such as ball in cup catch), meaning all version of the models did more or less the same, the correlation can be misleading. This table demonstrates the strong correlation between image prediction error and task performance
  • Table4: Summary of possible model designs based on whether or not to predict the future observations. All of the models predict the expected reward conditioned on previous observations, rewards and future actions. Moreover, the top four methods predict the future observations as well. These methods can model the transition function and reward function either in the latent space or in the observation space. Another design choice for these models is to whether or not share the learned latent space between reward and observation prediction
  • Table5: The asymptotic performance of various model designs in the online and offline settings and their differences. For the online setting the reported numbers are the average (and standard deviation) across three runs after the training. For the offline setting, the reported numbers are the same as Table 1 rounded up ± their standard deviation across three runs. This table demonstrates a significant performance improvement by predicting images almost across all the tasks. Moreover, there is a meaningful difference between the results for the online and the offline settings signifying the importance of exploration. Please note how some of best-performing models in the offline setting perform poorly in the online setting and vice-versa. This is clear from the bottom section of the table which includes the absolute difference of offline and online scores
  • Table6: Wall clock time of training and inference step of each model. The numbers are in seconds. Model R which only predicts the reward is the fastest model while the models with pixel space dynamics are the slowest. Please note that R is much smaller models compare to the other ones since it does not have any image decoder
  • Table7: Cost-normalized scores foe each model. These numbers are the online score achieved by each model divided by the inference cost (time) of the corresponding model. As expected, faster models – i.e. R and LT LR which do not predict the images at inference time – get a big advantage. This table clearly shows that LT LR is generally a good design choice if there is no specific reason for modeling the dynamic or reward function in the pixel space
  • Table8: The architecture of Rrecurrent
  • Table9: The architecture of Rconv
  • Table10: The hyper-parameters used for CEM. We used the same planning algorithm across all models and tasks
  • Table11: B.2 PLANET ARCHITECTURE. The hyper-parameter values used for SV2P (<a class="ref-link" id="cBabaeizadeh_et+al_2018_a" href="#rBabaeizadeh_et+al_2018_a">Babaeizadeh et al, 2018</a>; <a class="ref-link" id="cFinn_et+al_2016_a" href="#rFinn_et+al_2016_a">Finn et al, 2016a</a>) model (used in OT OR and OT LR). The number of ConvLSTM filters can be found in Table 14. The rest of parameters are the same as <a class="ref-link" id="cBabaeizadeh_et+al_2018_a" href="#rBabaeizadeh_et+al_2018_a">Babaeizadeh et al (2018</a>)
  • Table12: It is worth mentioning that although we used PlaNet as LT LR with no changes, our results are different from <a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a"><a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a"><a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a">Hafner et al (2018</a></a></a>). This is because we used 128 trajectories vs 1K in CEM (Table 10) as well as a training horizon of 12 vs 50 in <a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a"><a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a"><a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a">Hafner et al (2018</a></a></a>) (Table 12). We made these changes to keep planning consistent across our models. Other models are slower than LT LR to train and explore (Table 6) which made it infeasible to use them with original planning and training horizon of <a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a"><a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a"><a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a">Hafner et al (2018</a></a></a>). Please check Section B.4 for our approximate compute cost. The hyper-parameter values used for PlaNet (<a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a">Hafner et al, 2018</a>) model (used in LT OR and LT LR). The rest of parameters are the same as <a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a">Hafner et al (2018</a>)
  • Table13: The hyper-parameter values used for Rrecurrent (used as R). The hyper-parameters for the planner is the same as other models (Table 10)
  • Table14: The downsized version of OT OR. We down-scaled the model by reducing the number of ConvLSTM filters, limiting the modeling capacity and thereby potentially increasing their prediction error. The detailed architecture of the model and layers can be found in <a class="ref-link" id="cFinn_et+al_2016_a" href="#rFinn_et+al_2016_a">Finn et al (2016a</a>); <a class="ref-link" id="cBabaeizadeh_et+al_2018_a" href="#rBabaeizadeh_et+al_2018_a">Babaeizadeh et al (2018</a>)
  • Table15: The downsized version of LT LR. We down-scaled the model by reducing the number of units in fully connected paths, limiting the modeling capacity and thereby potentially increasing their prediction error. The detailed architecture of the model and layers can be found in <a class="ref-link" id="cHafner_et+al_2018_a" href="#rHafner_et+al_2018_a">Hafner et al (2018</a>)
Download tables as Excel
Related work
Funding
  • One of our key findings is that predicting images improves the performance of the agent, suggesting a way to improve the task performance of these methods
Reference
  • Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • Artemij Amiranashvili, Alexey Dosovitskiy, Vladlen Koltun, and Thomas Brox. Motion perception in reinforcement learning with dynamic objects. arXiv preprint arXiv:1901.03162, 2019.
    Findings
  • Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
    Google ScholarLocate open access versionFindings
  • Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • Ershad Banijamali, Rui Shu, Mohammad Ghavamzadeh, Hung Bui, and Ali Ghodsi. Robust locallylinear controllable embedding. arXiv preprint arXiv:1710.05373, 2017.
    Findings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765, 2018.
    Google ScholarLocate open access versionFindings
  • Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
    Findings
  • Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
    Google ScholarLocate open access versionFindings
  • Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011.
    Google ScholarLocate open access versionFindings
  • Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2013.
    Google ScholarLocate open access versionFindings
  • Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
    Findings
  • Emily L Denton et al. Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423, 2017.
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.
    Findings
  • Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
    Findings
  • Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72, 2016a.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 512–519. IEEE, 2016b.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062, 2019.
    Google ScholarLocate open access versionFindings
  • Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736, 2019.
    Findings
  • David Ha and Jurgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
    Findings
  • Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
    Findings
  • Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
    Findings
  • Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, pp. 517–526, 2018.
    Google ScholarLocate open access versionFindings
  • Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, and Sergey Levine. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. International Conference on Learning Representations (ICLR), 2020.
    Google ScholarFindings
  • Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
    Google ScholarLocate open access versionFindings
  • Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019.
    Findings
  • Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
    Findings
  • Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
    Google ScholarLocate open access versionFindings
  • Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
    Findings
  • Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. International Conference on Computer Vision (ICCV), 2017.
    Google ScholarLocate open access versionFindings
  • Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural Information Processing Systems, pp. 6118–6128, 2017.
    Google ScholarLocate open access versionFindings
  • Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pp. 2701–2710, 2017.
    Google ScholarLocate open access versionFindings
  • Sebastien Racaniere, Theophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imaginationaugmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701, 2017.
    Google ScholarLocate open access versionFindings
  • Aniruddh Raghu, Matthieu Komorowski, and Sumeetpal Singh. Model-based reinforcement learning for sepsis treatment. arXiv preprint arXiv:1811.09602, 2018.
    Findings
  • Scott E. Reed, Aaron van den Oord, Nal Kalchbrenner, Sergio Gomez Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
    Google ScholarLocate open access versionFindings
  • Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
    Findings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. arXiv preprint arXiv:2005.05960, 2020.
    Findings
Full Text
Your rating :
0

 

Tags
Comments