AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We derive a general performance bound for model-based Reinforcement Learning and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage

Trust the Model When It Is Confident: Masked Model-based Actor-Critic

NIPS 2020, (2020)

Cited by: 0|Views161
EI
Full Text
Bibtex
Weibo

Abstract

It is a popular belief that model-based Reinforcement Learning (RL) is more sample efficient than model-free RL, but in practice, it is not always true due to overweighed model errors. In complex and noisy settings, model-based RL tends to have trouble using the model if it does not know when to trust the model. In this work, we find ...More

Code:

Data:

0
Introduction
  • Deep RL has achieved great successes in complex decision-making problems [17, 21, 9].
  • There is a fundamental concern of MBRL [7] that learning a good policy requires an accurate model, which in turn requires a large number of interactions with the true environment
  • For this issue, theoretical results of MBPO [11] suggest to use the model when the model error is “sufficiently small”, which contradicts the intuition that MBRL should be used in low-data scenario.
Highlights
  • Deep Reinforcement Learning (RL) has achieved great successes in complex decision-making problems [17, 21, 9]
  • We show two motivating examples in Figure 1, which demonstrates that MBPO [11], a state-ofthe-art model-based reinforcement learning (MBRL) method, suffers from the mentioned limitations
  • We attribute the reason to the fact that Walker2d is known to be more difficult than HalfCheetah, so the performance of a MBRL algorithm in Walker2d is more sensitive to model errors
  • We derive a general performance bound for model-based RL and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage
  • By extensive experiments on continuous control benchmarks, we show that M2AC significantly outperforms state-of-the-art methods with enhanced stability and robustness even in challenging noisy tasks with long model rollouts
  • This paper aims to make reinforcement learning more reliable in practical use
Methods
  • The authors' experiments are designed to investigate two questions: (1) How does M2AC compare to prior model-free and model-based methods in sample efficiency? (2) Is M2AC stable and robust across various settings? How does its robustness compare to existing model-based methods?

    5.1 Baselines and implementation

    The authors' baselines include model-free methods PPO [20] and SAC [9], and two state-of-the-art modelbased methods STEVE [3] and MBPO [11].
  • In the implementation of MBRL methods, the most important hyper-parameter is the maximum model rollout length Hmax.
  • In order to understand the behavior and robustness of MBRL methods in a more realistic setting, the authors conduct experiments of noisy environments with a very few interactions, which is challenging for modern deep RL.
  • In HalfCheetah-based tasks, M2AC is robust to the rollout length even in the most difficult Noisy2 environment.
  • Overall, comparing to MBPO which can merely learn anything from noisy Walker2d environments, M2AC is significantly more robust
Results
  • Despite the noisy dynamics are naturally harder to learn, the authors see that M2AC performs robustly in all the environments and significantly outperforms MBPO.
Conclusion
  • The authors derive a general performance bound for model-based RL and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage.
  • The authors point out that previous methods that work well under ideal conditions can have a poor performance in a more realistic setting and are not robust to some hyper-parameters
  • The authors suggest that these factors should be taken into account for future work.
  • The authors improve the robustness and generalization ability by restricting model use with a notion of uncertainty
  • Such an insight can be universal in all areas for building reliable Artificial Intelligence systems
Funding
  • The research is supported by National Key Research and Development Program of China under Grant No 2017YFB1002104, National Natural Science Foundation of China under Grant No U1811461
Study subjects and analysis
ablation studies: 3
We observe that it performs robustly across almost all settings. Then we conducted three ablation studies about the masking mechanism, which are demonstrated together in the lower figure of Figure 5. On the penalty coefficient α, we observe that M2AC with α = 0.001 performs the best among [0.01, 0.001, 0]

Reference
  • Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.
    Google ScholarFindings
  • Sander Adam, Lucian Busoniu, and Robert Babuska. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2):201–212, 2012.
    Google ScholarLocate open access versionFindings
  • Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems. 2018.
    Google ScholarLocate open access versionFindings
  • Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011.
    Google ScholarLocate open access versionFindings
  • Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
    Findings
  • Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • D. Ha and J. Schmidhuber. World models. 2018.
    Google ScholarFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. arXiv preprint arXiv:1906.08253, 2019.
    Findings
  • Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
    Google ScholarLocate open access versionFindings
  • Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a framework for self-aware learning. Machine learning, 82(3):399–443, 2011.
    Google ScholarLocate open access versionFindings
  • Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin He, Qing He, and Pingzhong Tang. Policy optimization with model-based explorations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4675–4682, 2019.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.
    Findings
  • Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, 1990.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Tingwu Wang and Jimmy Ba. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科