AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We investigate how to explicitly tackle the distribution mismatch problem in model-based reinforcement learning

Model-based Policy Optimization with Unsupervised Model Adaptation

NIPS 2020, (2020)

Cited by: 0|Views185
EI
Full Text
Bibtex
Weibo

Abstract

Model-based reinforcement learning methods learn a dynamics model with real data sampled from the environment and leverage it to generate simulated data to derive an agent. However, due to the potential distribution mismatch between simulated data and real data, this could lead to degraded performance. Despite much effort being devoted ...More
0
Introduction
  • Model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e.g., video games [Mnih et al, 2015], complex robotic tasks [Haarnoja et al, 2018], just to name a few.
  • Even equipped with a high-capacity model, such model error still exists due to the potential distribution mismatch between the training and generating phases, i.e., the state-action input distribution used to train the model is different from the one generated by the model [Talvitie, 2014]
  • Because of this gap, the learned model may give inaccurate predictions
Highlights
  • In recent years, model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e.g., video games [Mnih et al, 2015], complex robotic tasks [Haarnoja et al, 2018], just to name a few
  • We evaluate our method on challenging continuous control benchmark tasks, and the experimental results demonstrate that the proposed AMPO achieves better performance against state-of-the-art model-based reinforcement learning (MBRL) and MFRL methods in terms of sample efficiency
  • According to the results shown in Figure 4(a), we find that: i) the vanilla model training in MBPO itself can slowly minimize the Wasserstein-1 distance between feature distributions; ii) the multi-step training loss in SLBO does help learn invariant features but the improvement is limited; iii) the model adaptation loss in AMPO is effective in promoting feature distribution alignment, which is consistent with our initial motivation
  • We investigate how to explicitly tackle the distribution mismatch problem in MBRL
  • We first provide a lower bound to justify the necessity of model adaptation to correct the potential distribution bias in MBRL
  • We observe that generating longer rollouts earlier in AMPO improves the performance while it degrades the performance of MBPO a little
  • We propose to incorporate unsupervised model adaptation with the intention of aligning the latent feature distributions of real data and simulated data
Methods
  • AMPO to other model-free and model-based algorithms.
  • Soft Actor-Critic (SAC) [Haarnoja et al, 2018] is the state-of-the-art model-free off-policy algorithm in terms of sample efficiency and asymptotic performance so the authors choose SAC for the model-free baseline.
  • Environments The authors evaluate AMPO and other baselines on six MuJoCo continuous control tasks with a maximum horizon of 1000 from OpenAI Gym [Brockman et al, 2016], including InvertedPendulum, Swimmer, Hopper, Walker2d, Ant and HalfCheetah.
  • For the other five environments, the authors adopt the same settings as in [Janner et al, 2019]
Results
  • The learning curves of all compared methods are presented in Figure 2.
  • Compared with MBPO, the approach achieves better performance in all the environments, which verifies the value of model adaptation.
  • The authors observe that both the training and validation losses of dynamics models in AMPO are smaller than that in MBPO throughout the learning process.
  • It shows that by incorporating model adaptation the learned model becomes more accurate.
  • The policy optimized based on the improved dynamics model can perform better
Conclusion
  • The authors investigate how to explicitly tackle the distribution mismatch problem in MBRL.
  • The authors propose to incorporate unsupervised model adaptation with the intention of aligning the latent feature distributions of real data and simulated data.
  • In this way, the model gives more accurate predictions when generating simulated data, and the follow-up policy optimization performance can be improved.
  • The authors believe the work takes an important step towards more sample-efficient MBRL
Summary
  • Introduction:

    Model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e.g., video games [Mnih et al, 2015], complex robotic tasks [Haarnoja et al, 2018], just to name a few.
  • Even equipped with a high-capacity model, such model error still exists due to the potential distribution mismatch between the training and generating phases, i.e., the state-action input distribution used to train the model is different from the one generated by the model [Talvitie, 2014]
  • Because of this gap, the learned model may give inaccurate predictions
  • Methods:

    AMPO to other model-free and model-based algorithms.
  • Soft Actor-Critic (SAC) [Haarnoja et al, 2018] is the state-of-the-art model-free off-policy algorithm in terms of sample efficiency and asymptotic performance so the authors choose SAC for the model-free baseline.
  • Environments The authors evaluate AMPO and other baselines on six MuJoCo continuous control tasks with a maximum horizon of 1000 from OpenAI Gym [Brockman et al, 2016], including InvertedPendulum, Swimmer, Hopper, Walker2d, Ant and HalfCheetah.
  • For the other five environments, the authors adopt the same settings as in [Janner et al, 2019]
  • Results:

    The learning curves of all compared methods are presented in Figure 2.
  • Compared with MBPO, the approach achieves better performance in all the environments, which verifies the value of model adaptation.
  • The authors observe that both the training and validation losses of dynamics models in AMPO are smaller than that in MBPO throughout the learning process.
  • It shows that by incorporating model adaptation the learned model becomes more accurate.
  • The policy optimized based on the improved dynamics model can perform better
  • Conclusion:

    The authors investigate how to explicitly tackle the distribution mismatch problem in MBRL.
  • The authors propose to incorporate unsupervised model adaptation with the intention of aligning the latent feature distributions of real data and simulated data.
  • In this way, the model gives more accurate predictions when generating simulated data, and the follow-up policy optimization performance can be improved.
  • The authors believe the work takes an important step towards more sample-efficient MBRL
Tables
  • Table1: Hyperparameter settings for AMPO results. [a, b, x, y] denotes a thresholded linear function, i.e
Download tables as Excel
Related work
Funding
  • The corresponding author Weinan Zhang is supported by "New Generation of AI 2030" Major Project (2018AAA0100900) and National Natural Science Foundation of China (61702327, 61772333, 61632017, 81771937)
Reference
  • [Arjovsky et al., 2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223.
    Google ScholarLocate open access versionFindings
  • [Asadi et al., 2019] Asadi, K., Misra, D., Kim, S., and Littman, M. L. (2019). Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320.
    Findings
  • [Asadi et al., 2018] Asadi, K., Misra, D., and Littman, M. (2018). Lipschitz continuity in modelbased reinforcement learning. In International Conference on Machine Learning, pages 264–273.
    Google ScholarLocate open access versionFindings
  • [Ben-David et al., 2010] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. Machine learning, 79(12):151–175.
    Google ScholarLocate open access versionFindings
  • [Ben-David et al., 2007] Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144.
    Google ScholarLocate open access versionFindings
  • [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
    Findings
  • [Buckman et al., 2018] Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. (2018). Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224–8234.
    Google ScholarLocate open access versionFindings
  • [Chua et al., 2018] Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765.
    Google ScholarLocate open access versionFindings
  • [Clavera et al., 2018] Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. (2018). Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, pages 617–629.
    Google ScholarLocate open access versionFindings
  • [Dean et al., 2019] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. (2019). On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47.
    Google ScholarLocate open access versionFindings
  • [Deisenroth and Rasmussen, 2011] Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A modelbased and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472.
    Google ScholarLocate open access versionFindings
  • [Farahmand, 2018] Farahmand, A.-m. (2018). Iterative value-aware model learning. In Advances in Neural Information Processing Systems, pages 9072–9083.
    Google ScholarLocate open access versionFindings
  • [Feinberg et al., 2018] Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101.
    Findings
  • [Fujimoto et al., 2019] Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062.
    Google ScholarLocate open access versionFindings
  • [Ganin et al., 2016] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030.
    Google ScholarLocate open access versionFindings
  • [Gretton et al., 2012] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773.
    Google ScholarLocate open access versionFindings
  • [Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777.
    Google ScholarLocate open access versionFindings
  • [Haarnoja et al., 2018] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
    Findings
  • [Hafner et al., 2019] Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2019). Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565.
    Google ScholarLocate open access versionFindings
  • [Ho and Ermon, 2016] Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565–4573.
    Google ScholarLocate open access versionFindings
  • [Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
    Google ScholarLocate open access versionFindings
  • [Janner et al., 2019] Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509.
    Google ScholarLocate open access versionFindings
  • [Kaiser et al., 2019] Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. (2019). Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374.
    Findings
  • [Langlois et al., 2019] Langlois, E., Zhang, S., Zhang, G., Abbeel, P., and Ba, J. (2019). Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057.
    Findings
  • [Levine et al., 2016] Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373.
    Google ScholarLocate open access versionFindings
  • [Luo et al., 2018] Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. (2018). Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858.
    Findings
  • [Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
    Google ScholarLocate open access versionFindings
  • [Müller, 1997] Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443.
    Google ScholarLocate open access versionFindings
  • [Nagabandi et al., 2018] Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
    Google ScholarLocate open access versionFindings
  • [Nguyen et al., 2018] Nguyen, N. M., Singh, A., and Tran, K. (2018). Improving model-based rl with adaptive rollout using uncertainty estimation.
    Google ScholarFindings
  • [Shen et al., 2018] Shen, J., Qu, Y., Zhang, W., and Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • [Simchowitz et al., 2018] Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and Recht, B. (2018). Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334.
    Findings
  • [Sriperumbudur et al., 2009] Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. (2009). On integral probability metrics,\phi-divergences and binary classification. arXiv preprint arXiv:0901.2698.
    Findings
  • [Sun et al., 2018] Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. (2018). Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540.
    Findings
  • [Sutton, 1990] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier.
    Google ScholarLocate open access versionFindings
  • [Szita and Szepesvári, 2010] Szita, I. and Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038.
    Google ScholarLocate open access versionFindings
  • [Talvitie, 2014] Talvitie, E. (2014). Model regularization for stable sample rollouts. In UAI, pages 780–789.
    Google ScholarLocate open access versionFindings
  • [Talvitie, 2017] Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • [Tzeng et al., 2017] Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176.
    Google ScholarLocate open access versionFindings
  • [Villani, 2008] Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
    Google ScholarFindings
  • [Wang and Ba, 2019] Wang, T. and Ba, J. (2019). Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649.
    Findings
  • [Wu et al., 2019] Wu, Y.-H., Fan, T.-H., Ramadge, P. J., and Su, H. (2019). Model imitation for model-based reinforcement learning. arXiv preprint arXiv:1909.11821.
    Findings
  • [Xiao et al., 2019] Xiao, C., Wu, Y., Ma, C., Schuurmans, D., and Müller, M. (2019). Learning to combat compounding-error in model-based reinforcement learning. arXiv preprint arXiv:1912.11206.
    Findings
  • [Yu et al., 2020] Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239.
    Findings
  • [Zhao et al., 2019] Zhao, H., Combes, R. T. d., Zhang, K., and Gordon, G. J. (2019). On learning invariant representation for domain adaptation. arXiv preprint arXiv:1901.09453.
    Findings
Your rating :
0

 

Tags
Comments
小科