## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Model-based Policy Optimization with Unsupervised Model Adaptation

NIPS 2020, (2020)

EI

Keywords

Abstract

Model-based reinforcement learning methods learn a dynamics model with real data sampled from the environment and leverage it to generate simulated data to derive an agent. However, due to the potential distribution mismatch between simulated data and real data, this could lead to degraded performance. Despite much effort being devoted ...More

Introduction

- Model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e.g., video games [Mnih et al, 2015], complex robotic tasks [Haarnoja et al, 2018], just to name a few.
- Even equipped with a high-capacity model, such model error still exists due to the potential distribution mismatch between the training and generating phases, i.e., the state-action input distribution used to train the model is different from the one generated by the model [Talvitie, 2014]
- Because of this gap, the learned model may give inaccurate predictions

Highlights

- In recent years, model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e.g., video games [Mnih et al, 2015], complex robotic tasks [Haarnoja et al, 2018], just to name a few
- We evaluate our method on challenging continuous control benchmark tasks, and the experimental results demonstrate that the proposed AMPO achieves better performance against state-of-the-art model-based reinforcement learning (MBRL) and MFRL methods in terms of sample efficiency
- According to the results shown in Figure 4(a), we find that: i) the vanilla model training in MBPO itself can slowly minimize the Wasserstein-1 distance between feature distributions; ii) the multi-step training loss in SLBO does help learn invariant features but the improvement is limited; iii) the model adaptation loss in AMPO is effective in promoting feature distribution alignment, which is consistent with our initial motivation
- We investigate how to explicitly tackle the distribution mismatch problem in MBRL
- We first provide a lower bound to justify the necessity of model adaptation to correct the potential distribution bias in MBRL
- We observe that generating longer rollouts earlier in AMPO improves the performance while it degrades the performance of MBPO a little
- We propose to incorporate unsupervised model adaptation with the intention of aligning the latent feature distributions of real data and simulated data

Methods

- AMPO to other model-free and model-based algorithms.
- Soft Actor-Critic (SAC) [Haarnoja et al, 2018] is the state-of-the-art model-free off-policy algorithm in terms of sample efficiency and asymptotic performance so the authors choose SAC for the model-free baseline.
- Environments The authors evaluate AMPO and other baselines on six MuJoCo continuous control tasks with a maximum horizon of 1000 from OpenAI Gym [Brockman et al, 2016], including InvertedPendulum, Swimmer, Hopper, Walker2d, Ant and HalfCheetah.
- For the other five environments, the authors adopt the same settings as in [Janner et al, 2019]

Results

- The learning curves of all compared methods are presented in Figure 2.
- Compared with MBPO, the approach achieves better performance in all the environments, which verifies the value of model adaptation.
- The authors observe that both the training and validation losses of dynamics models in AMPO are smaller than that in MBPO throughout the learning process.
- It shows that by incorporating model adaptation the learned model becomes more accurate.
- The policy optimized based on the improved dynamics model can perform better

Conclusion

- The authors investigate how to explicitly tackle the distribution mismatch problem in MBRL.
- The authors propose to incorporate unsupervised model adaptation with the intention of aligning the latent feature distributions of real data and simulated data.
- In this way, the model gives more accurate predictions when generating simulated data, and the follow-up policy optimization performance can be improved.
- The authors believe the work takes an important step towards more sample-efficient MBRL

Summary

## Introduction:

Model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e.g., video games [Mnih et al, 2015], complex robotic tasks [Haarnoja et al, 2018], just to name a few.- Even equipped with a high-capacity model, such model error still exists due to the potential distribution mismatch between the training and generating phases, i.e., the state-action input distribution used to train the model is different from the one generated by the model [Talvitie, 2014]
- Because of this gap, the learned model may give inaccurate predictions
## Methods:

AMPO to other model-free and model-based algorithms.- Soft Actor-Critic (SAC) [Haarnoja et al, 2018] is the state-of-the-art model-free off-policy algorithm in terms of sample efficiency and asymptotic performance so the authors choose SAC for the model-free baseline.
- Environments The authors evaluate AMPO and other baselines on six MuJoCo continuous control tasks with a maximum horizon of 1000 from OpenAI Gym [Brockman et al, 2016], including InvertedPendulum, Swimmer, Hopper, Walker2d, Ant and HalfCheetah.
- For the other five environments, the authors adopt the same settings as in [Janner et al, 2019]
## Results:

The learning curves of all compared methods are presented in Figure 2.- Compared with MBPO, the approach achieves better performance in all the environments, which verifies the value of model adaptation.
- The authors observe that both the training and validation losses of dynamics models in AMPO are smaller than that in MBPO throughout the learning process.
- It shows that by incorporating model adaptation the learned model becomes more accurate.
- The policy optimized based on the improved dynamics model can perform better
## Conclusion:

The authors investigate how to explicitly tackle the distribution mismatch problem in MBRL.- The authors propose to incorporate unsupervised model adaptation with the intention of aligning the latent feature distributions of real data and simulated data.
- In this way, the model gives more accurate predictions when generating simulated data, and the follow-up policy optimization performance can be improved.
- The authors believe the work takes an important step towards more sample-efficient MBRL

- Table1: Hyperparameter settings for AMPO results. [a, b, x, y] denotes a thresholded linear function, i.e

Related work

- The two important issues in MBRL methods are model learning and model usage. Model learning mainly involves two aspects: (1) function approximator choice like Gaussian process [Deisenroth and Rasmussen, 2011], time-varying linear models [Levine et al, 2016] and neural networks [Nagabandi et al, 2018], and (2) objective design like multi-step L2-norm [Luo et al, 2018], log loss [Chua et al, 2018] and adversarial loss [Wu et al, 2019]. Model usage can be roughly categorized into four groups: (1) improving policies using model-free algorithms like Dyna [Sutton, 1990, Luo et al, 2018, Clavera et al, 2018, Janner et al, 2019], (2) using model rollouts to improve target value estimates for temporal difference (TD) learning [Feinberg et al, 2018, Buckman et al, 2018], (3) searching policies with back-propagation through time by exploiting the model derivatives [Deisenroth and Rasmussen, 2011, Levine et al, 2016], and (4) planning by model predictive control (MPC) [Nagabandi et al, 2018, Chua et al, 2018] without explicit policy. The proposed AMPO framework with model adaptation can be viewed as an innovation in model learning by additionally adopting an adaptation loss function.

In this paper, we mainly focus on the distribution mismatch problem in deep MBRL [Talvitie, 2014], i.e., the state-action occupancy measure used for model learning mismatches the one generated for model usage. Several previous methods have been proposed to reduce the distribution mismatch wasserstein distance average return average return

Funding

- The corresponding author Weinan Zhang is supported by "New Generation of AI 2030" Major Project (2018AAA0100900) and National Natural Science Foundation of China (61702327, 61772333, 61632017, 81771937)

Reference

- [Arjovsky et al., 2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223.
- [Asadi et al., 2019] Asadi, K., Misra, D., Kim, S., and Littman, M. L. (2019). Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320.
- [Asadi et al., 2018] Asadi, K., Misra, D., and Littman, M. (2018). Lipschitz continuity in modelbased reinforcement learning. In International Conference on Machine Learning, pages 264–273.
- [Ben-David et al., 2010] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. Machine learning, 79(12):151–175.
- [Ben-David et al., 2007] Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144.
- [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
- [Buckman et al., 2018] Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. (2018). Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224–8234.
- [Chua et al., 2018] Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765.
- [Clavera et al., 2018] Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. (2018). Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, pages 617–629.
- [Dean et al., 2019] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. (2019). On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47.
- [Deisenroth and Rasmussen, 2011] Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A modelbased and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472.
- [Farahmand, 2018] Farahmand, A.-m. (2018). Iterative value-aware model learning. In Advances in Neural Information Processing Systems, pages 9072–9083.
- [Feinberg et al., 2018] Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101.
- [Fujimoto et al., 2019] Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062.
- [Ganin et al., 2016] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030.
- [Gretton et al., 2012] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773.
- [Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777.
- [Haarnoja et al., 2018] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
- [Hafner et al., 2019] Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2019). Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565.
- [Ho and Ermon, 2016] Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565–4573.
- [Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
- [Janner et al., 2019] Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509.
- [Kaiser et al., 2019] Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. (2019). Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374.
- [Langlois et al., 2019] Langlois, E., Zhang, S., Zhang, G., Abbeel, P., and Ba, J. (2019). Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057.
- [Levine et al., 2016] Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373.
- [Luo et al., 2018] Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. (2018). Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858.
- [Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
- [Müller, 1997] Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443.
- [Nagabandi et al., 2018] Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
- [Nguyen et al., 2018] Nguyen, N. M., Singh, A., and Tran, K. (2018). Improving model-based rl with adaptive rollout using uncertainty estimation.
- [Shen et al., 2018] Shen, J., Qu, Y., Zhang, W., and Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence.
- [Simchowitz et al., 2018] Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and Recht, B. (2018). Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334.
- [Sriperumbudur et al., 2009] Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. (2009). On integral probability metrics,\phi-divergences and binary classification. arXiv preprint arXiv:0901.2698.
- [Sun et al., 2018] Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. (2018). Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540.
- [Sutton, 1990] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier.
- [Szita and Szepesvári, 2010] Szita, I. and Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038.
- [Talvitie, 2014] Talvitie, E. (2014). Model regularization for stable sample rollouts. In UAI, pages 780–789.
- [Talvitie, 2017] Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence.
- [Tzeng et al., 2017] Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176.
- [Villani, 2008] Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
- [Wang and Ba, 2019] Wang, T. and Ba, J. (2019). Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649.
- [Wu et al., 2019] Wu, Y.-H., Fan, T.-H., Ramadge, P. J., and Su, H. (2019). Model imitation for model-based reinforcement learning. arXiv preprint arXiv:1909.11821.
- [Xiao et al., 2019] Xiao, C., Wu, Y., Ma, C., Schuurmans, D., and Müller, M. (2019). Learning to combat compounding-error in model-based reinforcement learning. arXiv preprint arXiv:1912.11206.
- [Yu et al., 2020] Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239.
- [Zhao et al., 2019] Zhao, H., Combes, R. T. d., Zhang, K., and Gordon, G. J. (2019). On learning invariant representation for domain adaptation. arXiv preprint arXiv:1901.09453.

Tags

Comments