Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning

NIPS 2020, (2020)

被引用0|浏览186
EI
下载 PDF 全文
引用
微博一下

摘要

Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we ...更多

代码

数据

0
简介
  • Deep reinforcement learning (RL) has exhibited wide success in solving sequential decision-making problems [23, 39, 45].
  • Despite the strong asymptotic performance, the applications of model-free RL have largely been limited to simulated domains due to its high sample complexity.
  • For this reason, model-based RL has been gaining considerable attention as a sample-efficient alternative, with an eye towards robotics and other physics domains.
  • This makes model-based RL algorithms unreliable to be deployed into real-world environments where partially unspecified dynamics are common; for instance, a deployed robot might not know a priori various features of the terrain it has to navigate
重点内容
  • Deep reinforcement learning (RL) has exhibited wide success in solving sequential decision-making problems [23, 39, 45]
  • We present a new model-based RL algorithm, coined trajectory-wise multiple choice learning (T-Multiple choice learning (MCL)), that can approximate the multi-modal distribution of transition dynamics in an unsupervised manner
  • We present trajectory-wise multiple choice learning, a new model-based RL algorithm that learns a multi-headed dynamics model for dynamics generalization
  • We show that our method can capture the multi-modal nature of environments in an unsupervised manner, and outperform existing model-based RL methods
  • In this paper, we focus on developing more robust and generalizable RL algorithm, which could improve the applicability of deep RL to various real-world applications, such as robotics manipulation [17] and package delivery [2]
  • Our method significantly outperforms all model-based RL baselines in all environments
  • Such advances in the robustness of RL algorithm could contribute to improved productivity of society via the safe and efficient utilization of autonomous agents in a diverse range of industries
方法
  • The authors demonstrate the effectiveness of the proposed method on classic control problems (i.e., CartPoleSwingUp and Pendulum) from OpenAI Gym [3] and simulated robotic continuous tasks (i.e., Hopper, SlimHumanoid, HalfCheetah, and CrippledAnt) from MuJoCo physics engine [44].
  • To evaluate the generalization performance, the authors designed environments to follow a multi-modal distribution by changing the environment parameters similar to Packer et al [33] and Zhou et al [48].
结果
  • The authors empirically show that this adaptive planning significantly improves the performance by selecting the prediction head specialized in a specific environment.
  • The authors' method significantly outperforms all model-based RL baselines in all environments
结论
  • The authors present trajectory-wise multiple choice learning, a new model-based RL algorithm that learns a multi-headed dynamics model for dynamics generalization.
  • While deep reinforcement learning (RL) has been successful in a range of challenging domains, it still suffers from a lack of generalization ability to unexpected changes in surrounding environmental factors [20, 30]
  • This failure of autonomous agents to generalize across diverse environments is one of the major reasoning behind the objection to real-world deployment of RL agents.
  • Such advances in the robustness of RL algorithm could contribute to improved productivity of society via the safe and efficient utilization of autonomous agents in a diverse range of industries
相关工作
  • Model-based reinforcement learning. By learning a forward dynamics model that approximates the transition dynamics of environments, model-based RL attains a superior sample-efficiency. Such a learned dynamics model can be used as a simulator for model-free RL methods [16, 18, 40], providing a prior or additional features to a policy [9, 47], or planning ahead to select actions by predicting the future consequences of actions [1, 22, 42]. A major challenge in model-based RL is to learn accurate dynamics models that can provide correct future predictions. To this end, numerous methods thus have been proposed, including ensembles [4] and latent dynamics models [14, 15, 37]. While these methods have made significant progress even in complex domains [15, 37], dynamics models still struggle to provide accurate predictions on unseen environments [20, 30].
基金
  • This research is supported in part by ONR PECASE N000141612723, Tencent, Berkeley Deep Drive, Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), and Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF2018R1A5A1059921)
引用论文
  • Atkeson, Christopher G and Santamaria, Juan Carlos. A comparison of direct and model-based reinforcement learning. In International Conference on Robotics and Automation, 1997.
    Google ScholarLocate open access versionFindings
  • Belkhale, Suneel, Li, Rachel, Kahn, Gregory, McAllister, Rowan, Calandra, Roberto, and Levine, Sergey. Model-based meta-reinforcement learning for flight with suspended payloads. arXiv preprint arXiv:2004.11345, 2020.
    Findings
  • Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Chua, Kurtland, Calandra, Roberto, McAllister, Rowan, and Levine, Sergey. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Clavera, Ignasi, Rothfuss, Jonas, Schulman, John, Fujita, Yasuhiro, Asfour, Tamim, and Abbeel, Pieter. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • De Boer, Pieter-Tjerk, Kroese, Dirk P, Mannor, Shie, and Rubinstein, Reuven Y. A tutorial on the cross-entropy method. Annals of operations research, 2005.
    Google ScholarLocate open access versionFindings
  • Deisenroth, Marc Peter, Neumann, Gerhard, and Peters, Jan. A survey on policy search for robotics. now publishers, 2013.
    Google ScholarFindings
  • Dey, Debadeepta, Ramakrishna, Varun, Hebert, Martial, and Andrew Bagnell, J. Predicting multiple structured visual interpretations. In International Conference on Computer Vision, 2015.
    Google ScholarLocate open access versionFindings
  • Du, Yilun and Narasimhan, Karthik. Task-agnostic dynamics priors for deep reinforcement learning. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L, Sutskever, Ilya, and Abbeel, Pieter. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Guzman-Rivera, Abner, Batra, Dhruv, and Kohli, Pushmeet. Multiple choice learning: Learning to produce multiple structured outputs. In Advances in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Guzman-Rivera, Abner, Kohli, Pushmeet, Batra, Dhruv, and Rutenbar, Rob. Efficiently enforcing diversity in multi-output structured prediction. In International Conference on Artificial Intelligence and Statistics, 2014.
    Google ScholarLocate open access versionFindings
  • Hafner, Danijar, on Robotics, LillInternational Conference, Automationp, Timothy, Fischer, Ian, Villegas, Ruben, Ha, David, Lee, Honglak, and Davidson, James. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Hafner, Danijar, on Robotics, LillInternational Conference, Automationp, Timothy, Ba, Jimmy, and Norouzi, Mohammad. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Janner, Michael, Fu, Justin, Zhang, Marvin, and Levine, Sergey. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Kalashnikov, Dmitry, Irpan, Alex, Pastor, Peter, Ibarz, Julian, Herzog, Alexander, Jang, Eric, Quillen, Deirdre, Holly, Ethan, Kalakrishnan, Mrinal, Vanhoucke, Vincent, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Kurutach, Thanard, Clavera, Ignasi, Duan, Yan, Tamar, Aviv, and Abbeel, Pieter. Modelensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Lee, Kimin, Hwang, Changho, Park, Kyoung Soo, and Shin, Jinwoo. Confident multiple choice learning. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Lee, Kimin, Seo, Younggyo, Lee, Seunghyun, Lee, Honglak, and Shin, Jinwoo. Context-aware dynamics model for generalization in model-based reinforcement learning. In International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • Lee, Stefan, Prakash, Senthil Purushwalkam Shiva, Cogswell, Michael, Ranjan, Viresh, Crandall, David, and Batra, Dhruv. Stochastic multiple choice learning for training diverse deep ensembles. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Lenz, Ian, Knepper, Ross A, and Saxena, Ashutosh. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Levine, Sergey and Abbeel, Pieter. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
    Google ScholarLocate open access versionFindings
  • Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Maaten, Laurens van der and Hinton, Geoffrey. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
    Google ScholarLocate open access versionFindings
  • Maciejowski, Jan Marian. Predictive control: with constraints. Pearson education, 2002.
    Google ScholarFindings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Mun, Jonghwan, Lee, Kimin, Shin, Jinwoo, and Han, Bohyung. Learning to specialize with knowledge distillation for visual question answering. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Nagabandi, Anusha, Clavera, Ignasi, Liu, Simin, Fearing, Ronald S, Abbeel, Pieter, Levine, Sergey, and Finn, Chelsea. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Nagabandi, Anusha, Finn, Chelsea, and Levine, Sergey. Deep online learning via metalearning: Continual adaptation for model-based rl. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Narendra, Kumpati S and Balakrishnan, Jeyendran. Improving transient response of adaptive control systems using multiple models and switching. In IEEE Transactions on Automatic Control, 1994.
    Google ScholarLocate open access versionFindings
  • Packer, Charles, Gao, Katelyn, Kos, Jernej, Krähenbühl, Philipp, Koltun, Vladlen, and Song, Dawn. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282, 2018.
    Findings
  • Rakelly, Kate, Zhou, Aurick, Quillen, Deirdre, Finn, Chelsea, and Levine, Sergey. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Sæmundsson, Steindór, Hofmann, Katja, and Deisenroth, Marc Peter. Meta reinforcement learning with latent variable gaussian processes. In Conference on Uncertainty in Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Sanchez-Gonzalez, Alvaro, Heess, Nicolas, Springenberg, Jost Tobias, Merel, Josh, Riedmiller, Martin, Hadsell, Raia, and Battaglia, Peter. Graph networks as learnable physics engines for inference and control. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis, Graepel, Thore, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
    Findings
  • Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael, and Moritz, Philipp. Trust region policy optimization. In International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • Sutton, Richard S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, 1990.
    Google ScholarLocate open access versionFindings
  • Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 2018.
    Google ScholarFindings
  • Tassa, Yuval, Erez, Tom, and Todorov, Emanuel. Synthesis and stabilization of complex behaviors through online trajectory optimization. In International Conference on Intelligent Robots and Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Tian, Kai, Xu, Yi, Zhou, Shuigeng, and Guan, Jihong. Versatile multiple choice learning and its application to vision computing. In International Conference on Computer Vision, 2019.
    Google ScholarLocate open access versionFindings
  • Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Vinyals, Oriol, Babuschkin, Igor, Czarnecki, Wojciech M, Mathieu, Michael, Dudzik, Andrew, Chung, Junyoung, Choi, David H, Powell, Richard, Ewalds, Timo, Georgiev, Petko, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782): 350–354, 2019.
    Google ScholarLocate open access versionFindings
  • Wang, Tingwu and Ba, Jimmy. Exploring model-based planning with policy networks. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Whitney, William, Agarwal, Rajat, Cho, Kyunghyun, and Gupta, Abhinav. Dynamics-aware embeddings. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Zhou, Wenxuan, Pinto, Lerrel, and Gupta, Abhinav. Environment probing interaction policies. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
作者
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科