AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We derive a general performance bound for model-based Reinforcement Learning and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage
Trust the Model When It Is Confident: Masked Model-based Actor-Critic
NIPS 2020, (2020)
It is a popular belief that model-based Reinforcement Learning (RL) is more sample efficient than model-free RL, but in practice, it is not always true due to overweighed model errors. In complex and noisy settings, model-based RL tends to have trouble using the model if it does not know when to trust the model. In this work, we find ...More
PPT (Upload PPT)
- Deep RL has achieved great successes in complex decision-making problems [17, 21, 9].
- There is a fundamental concern of MBRL  that learning a good policy requires an accurate model, which in turn requires a large number of interactions with the true environment
- For this issue, theoretical results of MBPO  suggest to use the model when the model error is “sufficiently small”, which contradicts the intuition that MBRL should be used in low-data scenario.
- Deep Reinforcement Learning (RL) has achieved great successes in complex decision-making problems [17, 21, 9]
- We show two motivating examples in Figure 1, which demonstrates that MBPO , a state-ofthe-art model-based reinforcement learning (MBRL) method, suffers from the mentioned limitations
- We attribute the reason to the fact that Walker2d is known to be more difficult than HalfCheetah, so the performance of a MBRL algorithm in Walker2d is more sensitive to model errors
- We derive a general performance bound for model-based RL and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage
- By extensive experiments on continuous control benchmarks, we show that M2AC significantly outperforms state-of-the-art methods with enhanced stability and robustness even in challenging noisy tasks with long model rollouts
- This paper aims to make reinforcement learning more reliable in practical use
- The authors' experiments are designed to investigate two questions: (1) How does M2AC compare to prior model-free and model-based methods in sample efficiency? (2) Is M2AC stable and robust across various settings? How does its robustness compare to existing model-based methods?
5.1 Baselines and implementation
The authors' baselines include model-free methods PPO  and SAC , and two state-of-the-art modelbased methods STEVE  and MBPO .
- In the implementation of MBRL methods, the most important hyper-parameter is the maximum model rollout length Hmax.
- In order to understand the behavior and robustness of MBRL methods in a more realistic setting, the authors conduct experiments of noisy environments with a very few interactions, which is challenging for modern deep RL.
- In HalfCheetah-based tasks, M2AC is robust to the rollout length even in the most difficult Noisy2 environment.
- Overall, comparing to MBPO which can merely learn anything from noisy Walker2d environments, M2AC is significantly more robust
- Despite the noisy dynamics are naturally harder to learn, the authors see that M2AC performs robustly in all the environments and significantly outperforms MBPO.
- The authors derive a general performance bound for model-based RL and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage.
- The authors point out that previous methods that work well under ideal conditions can have a poor performance in a more realistic setting and are not robust to some hyper-parameters
- The authors suggest that these factors should be taken into account for future work.
- The authors improve the robustness and generalization ability by restricting model use with a notion of uncertainty
- Such an insight can be universal in all areas for building reliable Artificial Intelligence systems
- The research is supported by National Key Research and Development Program of China under Grant No 2017YFB1002104, National Natural Science Foundation of China under Grant No U1811461
Study subjects and analysis
ablation studies: 3
We observe that it performs robustly across almost all settings. Then we conducted three ablation studies about the masking mechanism, which are demonstrated together in the lower figure of Figure 5. On the penalty coefficient α, we observe that M2AC with α = 0.001 performs the best among [0.01, 0.001, 0]
- Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.
- Sander Adam, Lucian Busoniu, and Robert Babuska. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2):201–212, 2012.
- Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, 2018.
- Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems. 2018.
- Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011.
- Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
- Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. In International Conference on Machine Learning, 2018.
- D. Ha and J. Schmidhuber. World models. 2018.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. arXiv preprint arXiv:1906.08253, 2019.
- Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, 2017.
- Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.
- Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
- Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a framework for self-aware learning. Machine learning, 82(3):399–443, 2011.
- Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, 2019.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin He, Qing He, and Pingzhong Tang. Policy optimization with model-based explorations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4675–4682, 2019.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.
- Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, 1990.
- Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
- Tingwu Wang and Jimmy Ba. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.