SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning.

pp.7444-7453 (2019)

被引用124|浏览155
EI
下载 PDF 全文
引用
微博一下

摘要

Model-based reinforcement learning (RL) has proven to be a data efficient approach for learning control tasks but is difficult to utilize in domains with complex observations such as images. In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, in that these repre...更多

代码

数据

0
简介
  • Model-based reinforcement learning (RL) methods use known or learned models in a variety of ways, such as planning through the model and generating synthetic experience (Sutton, 1990; Kober et al, 2013).
  • Many model-based methods rely on accurate forward prediction for planning (Nagabandi et al, 2018; Chua et al, 2018), and for image-based domains, this precludes the use of simple models which will introduce significant modeling bias.
  • The authors focus on removing the need for accurate forward prediction, using what the authors term local models methods
  • These methods use simple models, typically linear models, to provide gradient directions for local policy improvement, rather than for forward prediction and planning (Todorov & Li, 2005; Levine & Abbeel, 2014).
  • Local model methods circumvent the need for accurate predictive models, but these methods cannot be directly applied to image-based tasks because image dynamics, even locally speaking, are highly non-linear
重点内容
  • Model-based reinforcement learning (RL) methods use known or learned models in a variety of ways, such as planning through the model and generating synthetic experience (Sutton, 1990; Kober et al, 2013)
  • For more complex domains, one of the main difficulties in applying model-based methods is modeling bias: if control or policy learning is performed against an imperfect model, performance in the real world will typically degrade with model inaccuracy (Deisenroth et al, 2014)
  • We presented stochastic optimal control with latent representations (SOLAR), a model-based RL algorithm that is capable of learning policies in a data-efficient manner directly from raw high-dimensional image observations
  • The key insights in SOLAR involve learning latent representations where simple models are more accurate and utilizing probabilistic graphical model (PGM) structure to infer dynamics from data conditioned on observed trajectories
  • SOLAR is significantly more data-efficient compared to model-free RL methods, especially when transferring previously learned representations and models
  • We show that SOLAR can learn complex real-world robotic manipulation tasks with only image observations in one to two hours of interaction time
方法
  • Methods based on

    LQR have enjoyed considerable success in a number of control domains, including learning tasks on real robotic systems (Todorov & Li, 2005; Levine et al, 2016).
  • There is some work on lifting this restriction: for example, Watter et al (2015) and Banijamali et al (2018) combine LQR-based control with a representation learning scheme based on the variational auto-encoder (VAE; Kingma & Welling, 2014; Rezende et al, 2014) where images are encoded into a learned low-dimensional representation that is used for modeling and control
  • They demonstrate success on learning several continuous control domains directly from pixel observations.
结论
  • The authors presented SOLAR, a model-based RL algorithm that is capable of learning policies in a data-efficient manner directly from raw high-dimensional image observations.
  • The key insights in SOLAR involve learning latent representations where simple models are more accurate and utilizing PGM structure to infer dynamics from data conditioned on observed trajectories.
  • The authors' experimental results demonstrate that SOLAR is competitive in sample efficiency, while exhibiting superior final policy performance, compared to other model-based methods.
  • SOLAR is significantly more data-efficient compared to model-free RL methods, especially when transferring previously learned representations and models.
  • The authors show that SOLAR can learn complex real-world robotic manipulation tasks with only image observations in one to two hours of interaction time
相关工作
  • Utilizing representation learning within model-based RL has been studied in a number of previous works (Lesort et al, 2018), including using embeddings for state aggregation (Singh et al, 1994), dimensionality reduction (Nouri & Littman, 2010), self-organizing maps (Smith, 2002), value prediction (Oh et al, 2017), and deep auto-encoders (Lange & Riedmiller, 2010; Higgins et al, 2017). Among these works, deep spatial auto-encoders (DSAE; Finn et al, 2016) and embed to control (E2C; Watter et al, 2015; Banijamali et al, 2018) are the most closely related to our work, in that they consider local model methods combined with representation learning. The key difference in our work is that, rather than using a learning objective for reconstruction and forward prediction, our objective is more suited for local model methods by directly encouraging learning representations where fitting local models accurately explains the observed data. We also do not assume a known cost function, goal state, or access to the underlying system state as in DSAE and E2C, making SOLAR applicable even when the underlying states and cost function are unknown.3

    Subsequent to our work, Hafner et al (2018) formulate a representation and model learning method for image-based continuous control tasks that is used in conjunction with model-predictive control (MPC), which plans H time steps ahead using the model, executes an action based on this plan, and then re-plans after receiving the next observation. We compare to a baseline that uses MPC in Section 7, and we empirically demonstrate the relative strengths of SOLAR and MPC, showing that SOLAR can overcome the shorthorizon bias that afflicts MPC. We also compare to robust locally-linear controllable embedding (RCE; Banijamali et al, 2018), an improved version of E2C, and we find that our approach tends to produce better empirical results.
基金
  • MZ is supported by an NDSEG fellowship
  • SV is supported by NSF grant CNS1446912
  • This work was supported by the NSF, through IIS-1614653, and computational resource donations from Amazon
研究对象与分析
times more samples: 1000
Illustrations of the environments we test on in the top row with example image observations in the bottow row. Left to right: visualizing a trajectory in the nonholonomic car environment, with the target denoted by the black dot; an illustration of the 2-DoF reacher environment, with the target denoted by the red dot; the different tasks that we test for block stacking, where the rightmost task is the most difficult as the policy must learn to first lift the yellow block before stacking it; a depiction of our pushing setup, where a human provides the sparse reward that indicates whether the robot successfully pushed the mug onto the coaster. Full size versions of these plots are available on the project website. (a): Our method, the MPC baseline, and the VAE ablation consistently solve 2D navigation with a randomized goal, whereas RCE is unable to make progress. The final performance of PPO is plotted as the dashed line, though PPO requires 1000 times more samples than our method to reach this performance. (b): On the nonholonomic car, both our method and the MPC baseline are able to reach the goal, though the VAE ablation is less consistent across seeds and RCE once again is unsuccessful at the task. PPO requires over 25 times more episodes than our method to learn a successful policy. (c): On reacher, we perform worse than PPO but use about 40 times fewer episodes. RCE fails to learn at all, and the VAE ablation and MPC baseline are noticeably worse than our method. Here we plot reward, so higher is better. Our method consistently solves all block stacking tasks. The MPC baseline learns very quickly on the two easier tasks since it can plan through the pretrained model, however, due to the short-horizon planning, it performs significantly worse on the hard task on the right where the block starts on the table. The VAE ablation performs well on the easiest task in the middle but is unsuccessful on the two harder tasks. DVF makes progress for each task but ultimately is not as data efficient as SOLAR. The black solid line at 0.02m denotes successful stacking

引用论文
  • Agrawal, P., Nair, A., Abbeel, P., Malik, J., and Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., and Ghodsi, A. Robust locally-linear controllable embedding. In AISTATS, 2018.
    Google ScholarLocate open access versionFindings
  • Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
    Google ScholarFindings
  • Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Camacho, E. and Alba, C. Model Predictive Control. Springer Science and Business Media, 2013.
    Google ScholarFindings
  • Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Deisenroth, M., Fox, D., and Rasmussen, C. Gaussian processes for data-efficient learning in robotics and control. PAMI, 2014.
    Google ScholarFindings
  • Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
    Findings
  • Feinberg, V., Wan, A., Stoica, I., Jordan, M., Gonzalez, J., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
    Findings
  • Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In ICRA, 2016.
    Google ScholarLocate open access versionFindings
  • Fu, J., Singh, A., Ghosh, D., Yang, L., and Levine, S. Variational inverse control with events: A general framework for data-driven reward definition. In NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
    Findings
  • Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: Improving zero-shot transfer in reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Hoffman, M., Blei, D., Wang, C., and Paisley, J. Stochastic variational inference. JMLR, 2013.
    Google ScholarLocate open access versionFindings
  • Jacobson, D. and Mayne, D. Differential Dynamic Programming. American Elsevier, 1970.
    Google ScholarFindings
  • Johnson, M., Duvenaud, D., Wiltschko, A., Datta, S., and Adams, R. Composing graphical models with neural networks for structured representations and fast inference. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Kingma, D. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Kober, J., Bagnell, J., and Peters, J. Reinforcement learning in robotics: A survey. IJRR, 2013.
    Google ScholarLocate open access versionFindings
  • Lange, S. and Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. In IJCNN, 2010.
    Google ScholarLocate open access versionFindings
  • Lesort, T., Díaz-Rodríguez, N., Goudou, J., and Filliat, D. State representation learning for control: An overview. Neural Networks, 2018.
    Google ScholarLocate open access versionFindings
  • Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. JMLR, 2016.
    Google ScholarLocate open access versionFindings
  • Moldovan, T., Levine, S., Jordan, M., and Abbeel, P. Optimism-driven exploration for nonlinear systems. In ICRA, 2015.
    Google ScholarLocate open access versionFindings
  • Nagabandi, A., Kahn, G., Fearing, R., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, 2018.
    Google ScholarLocate open access versionFindings
  • Nouri, A. and Littman, M. Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning, 2010.
    Google ScholarLocate open access versionFindings
  • Oh, J., Singh, S., and Lee, H. Value prediction network. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Pinto, L. and Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
    Google ScholarLocate open access versionFindings
  • Rezende, D., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Singh, S., Jaakkola, T., and Jordan, M. Reinforcement learning with soft state aggregation. In NIPS, 1994.
    Google ScholarLocate open access versionFindings
  • Smith, A. Applications of the self-organizing map to reinforcement learning. Neural Networks, 2002.
    Google ScholarLocate open access versionFindings
  • Sutton, R. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
    Google ScholarLocate open access versionFindings
  • Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors. In IROS, 2012.
    Google ScholarLocate open access versionFindings
  • Todorov, E. and Li, W. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In ACC, 2005.
    Google ScholarLocate open access versionFindings
  • Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Winn, J. and Bishop, C. Variational message passing. JMLR, 2005.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科