SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning.
Model-based reinforcement learning (RL) has proven to be a data efficient approach for learning control tasks but is difficult to utilize in domains with complex observations such as images. In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, in that these repre...更多
下载 PDF 全文
- Model-based reinforcement learning (RL) methods use known or learned models in a variety of ways, such as planning through the model and generating synthetic experience (Sutton, 1990; Kober et al, 2013).
- Many model-based methods rely on accurate forward prediction for planning (Nagabandi et al, 2018; Chua et al, 2018), and for image-based domains, this precludes the use of simple models which will introduce significant modeling bias.
- The authors focus on removing the need for accurate forward prediction, using what the authors term local models methods
- These methods use simple models, typically linear models, to provide gradient directions for local policy improvement, rather than for forward prediction and planning (Todorov & Li, 2005; Levine & Abbeel, 2014).
- Local model methods circumvent the need for accurate predictive models, but these methods cannot be directly applied to image-based tasks because image dynamics, even locally speaking, are highly non-linear
- Model-based reinforcement learning (RL) methods use known or learned models in a variety of ways, such as planning through the model and generating synthetic experience (Sutton, 1990; Kober et al, 2013)
- For more complex domains, one of the main difficulties in applying model-based methods is modeling bias: if control or policy learning is performed against an imperfect model, performance in the real world will typically degrade with model inaccuracy (Deisenroth et al, 2014)
- We presented stochastic optimal control with latent representations (SOLAR), a model-based RL algorithm that is capable of learning policies in a data-efficient manner directly from raw high-dimensional image observations
- The key insights in SOLAR involve learning latent representations where simple models are more accurate and utilizing probabilistic graphical model (PGM) structure to infer dynamics from data conditioned on observed trajectories
- SOLAR is significantly more data-efficient compared to model-free RL methods, especially when transferring previously learned representations and models
- We show that SOLAR can learn complex real-world robotic manipulation tasks with only image observations in one to two hours of interaction time
- Methods based on
LQR have enjoyed considerable success in a number of control domains, including learning tasks on real robotic systems (Todorov & Li, 2005; Levine et al, 2016).
- There is some work on lifting this restriction: for example, Watter et al (2015) and Banijamali et al (2018) combine LQR-based control with a representation learning scheme based on the variational auto-encoder (VAE; Kingma & Welling, 2014; Rezende et al, 2014) where images are encoded into a learned low-dimensional representation that is used for modeling and control
- They demonstrate success on learning several continuous control domains directly from pixel observations.
- The authors presented SOLAR, a model-based RL algorithm that is capable of learning policies in a data-efficient manner directly from raw high-dimensional image observations.
- The key insights in SOLAR involve learning latent representations where simple models are more accurate and utilizing PGM structure to infer dynamics from data conditioned on observed trajectories.
- The authors' experimental results demonstrate that SOLAR is competitive in sample efficiency, while exhibiting superior final policy performance, compared to other model-based methods.
- SOLAR is significantly more data-efficient compared to model-free RL methods, especially when transferring previously learned representations and models.
- The authors show that SOLAR can learn complex real-world robotic manipulation tasks with only image observations in one to two hours of interaction time
- Utilizing representation learning within model-based RL has been studied in a number of previous works (Lesort et al, 2018), including using embeddings for state aggregation (Singh et al, 1994), dimensionality reduction (Nouri & Littman, 2010), self-organizing maps (Smith, 2002), value prediction (Oh et al, 2017), and deep auto-encoders (Lange & Riedmiller, 2010; Higgins et al, 2017). Among these works, deep spatial auto-encoders (DSAE; Finn et al, 2016) and embed to control (E2C; Watter et al, 2015; Banijamali et al, 2018) are the most closely related to our work, in that they consider local model methods combined with representation learning. The key difference in our work is that, rather than using a learning objective for reconstruction and forward prediction, our objective is more suited for local model methods by directly encouraging learning representations where fitting local models accurately explains the observed data. We also do not assume a known cost function, goal state, or access to the underlying system state as in DSAE and E2C, making SOLAR applicable even when the underlying states and cost function are unknown.3
Subsequent to our work, Hafner et al (2018) formulate a representation and model learning method for image-based continuous control tasks that is used in conjunction with model-predictive control (MPC), which plans H time steps ahead using the model, executes an action based on this plan, and then re-plans after receiving the next observation. We compare to a baseline that uses MPC in Section 7, and we empirically demonstrate the relative strengths of SOLAR and MPC, showing that SOLAR can overcome the shorthorizon bias that afflicts MPC. We also compare to robust locally-linear controllable embedding (RCE; Banijamali et al, 2018), an improved version of E2C, and we find that our approach tends to produce better empirical results.
- MZ is supported by an NDSEG fellowship
- SV is supported by NSF grant CNS1446912
- This work was supported by the NSF, through IIS-1614653, and computational resource donations from Amazon
times more samples: 1000
Illustrations of the environments we test on in the top row with example image observations in the bottow row. Left to right: visualizing a trajectory in the nonholonomic car environment, with the target denoted by the black dot; an illustration of the 2-DoF reacher environment, with the target denoted by the red dot; the different tasks that we test for block stacking, where the rightmost task is the most difficult as the policy must learn to first lift the yellow block before stacking it; a depiction of our pushing setup, where a human provides the sparse reward that indicates whether the robot successfully pushed the mug onto the coaster. Full size versions of these plots are available on the project website. (a): Our method, the MPC baseline, and the VAE ablation consistently solve 2D navigation with a randomized goal, whereas RCE is unable to make progress. The final performance of PPO is plotted as the dashed line, though PPO requires 1000 times more samples than our method to reach this performance. (b): On the nonholonomic car, both our method and the MPC baseline are able to reach the goal, though the VAE ablation is less consistent across seeds and RCE once again is unsuccessful at the task. PPO requires over 25 times more episodes than our method to learn a successful policy. (c): On reacher, we perform worse than PPO but use about 40 times fewer episodes. RCE fails to learn at all, and the VAE ablation and MPC baseline are noticeably worse than our method. Here we plot reward, so higher is better. Our method consistently solves all block stacking tasks. The MPC baseline learns very quickly on the two easier tasks since it can plan through the pretrained model, however, due to the short-horizon planning, it performs significantly worse on the hard task on the right where the block starts on the table. The VAE ablation performs well on the easiest task in the middle but is unsuccessful on the two harder tasks. DVF makes progress for each task but ultimately is not as data efficient as SOLAR. The black solid line at 0.02m denotes successful stacking
- Agrawal, P., Nair, A., Abbeel, P., Malik, J., and Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. In NIPS, 2016.
- Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., and Ghodsi, A. Robust locally-linear controllable embedding. In AISTATS, 2018.
- Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
- Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
- Camacho, E. and Alba, C. Model Predictive Control. Springer Science and Business Media, 2013.
- Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In ICML, 2017.
- Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NIPS, 2018.
- Deisenroth, M., Fox, D., and Rasmussen, C. Gaussian processes for data-efficient learning in robotics and control. PAMI, 2014.
- Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
- Feinberg, V., Wan, A., Stoica, I., Jordan, M., Gonzalez, J., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
- Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In ICRA, 2017.
- Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In ICRA, 2016.
- Fu, J., Singh, A., Ghosh, D., Yang, L., and Levine, S. Variational inverse control with events: A general framework for data-driven reward definition. In NIPS, 2018.
- Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In ICML, 2018.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
- Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: Improving zero-shot transfer in reinforcement learning. In ICML, 2017.
- Hoffman, M., Blei, D., Wang, C., and Paisley, J. Stochastic variational inference. JMLR, 2013.
- Jacobson, D. and Mayne, D. Differential Dynamic Programming. American Elsevier, 1970.
- Johnson, M., Duvenaud, D., Wiltschko, A., Datta, S., and Adams, R. Composing graphical models with neural networks for structured representations and fast inference. In NIPS, 2016.
- Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
- Kingma, D. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014.
- Kober, J., Bagnell, J., and Peters, J. Reinforcement learning in robotics: A survey. IJRR, 2013.
- Lange, S. and Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. In IJCNN, 2010.
- Lesort, T., Díaz-Rodríguez, N., Goudou, J., and Filliat, D. State representation learning for control: An overview. Neural Networks, 2018.
- Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In NIPS, 2014.
- Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. JMLR, 2016.
- Moldovan, T., Levine, S., Jordan, M., and Abbeel, P. Optimism-driven exploration for nonlinear systems. In ICRA, 2015.
- Nagabandi, A., Kahn, G., Fearing, R., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, 2018.
- Nouri, A. and Littman, M. Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning, 2010.
- Oh, J., Singh, S., and Lee, H. Value prediction network. In NIPS, 2017.
- Pinto, L. and Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
- Rezende, D., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Singh, S., Jaakkola, T., and Jordan, M. Reinforcement learning with soft state aggregation. In NIPS, 1994.
- Smith, A. Applications of the self-organizing map to reinforcement learning. Neural Networks, 2002.
- Sutton, R. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
- Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors. In IROS, 2012.
- Todorov, E. and Li, W. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In ACC, 2005.
- Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, 2015.
- Winn, J. and Bishop, C. Variational message passing. JMLR, 2005.