AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present a robust reinforcement learning algorithm, Structured Maximum Entropy Reinforcement Learning, for learning RL policies that can extrapolate to out-of-distribution test conditions with only a small number of trials

One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL

NIPS 2020, (2020)

Cited by: 0|Views52
EI
Full Text
Bibtex
Weibo

Abstract

While reinforcement learning algorithms can learn effective policies for complex tasks, these policies are often brittle to even minor task variations, especially when variations are not explicitly provided during training. One natural approach to this problem is to train agents with manually specified variation in the training task or ...More

Code:

Data:

0
Introduction
  • Deep reinforcement learning (RL) algorithms have demonstrated promising results on a variety of complex tasks, such as robotic manipulation [21, 12] and strategy games [26, 37].
  • The authors' algorithm, Structured Maximum Entropy Reinforcement Learning (SMERL), optimizes the approximate objective on a single training MDP.
  • In order to analyze the connection between the policies learned by SMERL and robustness to test MDPs, the authors consider a related robustness set, defined in terms of sub-optimal policies on the training MDP: Definition 2.
Highlights
  • Deep reinforcement learning (RL) algorithms have demonstrated promising results on a variety of complex tasks, such as robotic manipulation [21, 12] and strategy games [26, 37]
  • With a small perturbation magnitude, the optimal policies achieve near-optimal return when executed on the training Markov decision process (MDP), and these policies yield the same trajectories when executed on the train and test MDPs, satisfying the definition of our MDP robustness set which we can expect Structured Maximum Entropy Reinforcement Learning (SMERL) to be robust to
  • We present a robust RL algorithm, SMERL, for learning RL policies that can extrapolate to out-of-distribution test conditions with only a small number of trials
  • In our theoretical analysis of SMERL, we formally describe the types of test MDPs under which we can expect SMERL to generalize
  • Our empirical results suggest that SMERL is more robust to various test conditions and outperforms prior diversity-driven RL approaches
  • Structured max-ent RL may be helpful for situations other than robustness, such as hierarchical RL or transfer learning settings when learned behaviors need to be reused for new purposes
Results
  • In Section 7, the authors will empirically measure SMERL’s robustness when optimizing the SMERL objective on practical problems where the test perturbations satisfy these conditions.
  • The goal of the experimental evaluation is to test the central hypothesis of the work: does structured diversity-driven learning lead to policies that generalize to new MDPs?
  • With a small perturbation magnitude, the optimal policies achieve near-optimal return when executed on the training MDP, and these policies yield the same trajectories when executed on the train and test MDPs, satisfying the definition of the MDP robustness set which the authors can expect SMERL to be robust to.
  • Given that SMERL learns distinguishable and diverse policies in simple environments, the authors study whether these policies are robust to various test conditions in more challenging continuous-control problems.
  • The authors compare SMERL to standard maximum-entropy RL (SAC), an approach that learns multiple diverse policies but does not maximize a reward signal from the environment (DIAYN), a naive combination of SAC and DIAYN (SAC+DIAYN), and Robust Adversarial Reinforcement Learning (RARL), which is a robust RL method.
  • By comparing to SAC and DIAYN, the authors aim to test the importance of learning diverse policies and of ensuring near-optimal return, for achieving robustness.
  • DIAYN learns multiple diverse policies, but since it is trained independently of task reward, it only occasionally solves the task and otherwise produces policies that perform structured diverse behavior but do not achieve near-optimal return.
  • SMERL balances the task reward and DIAYN reward to achieve few-shot robustness, since it only adds the DIAYN reward when the latent policies are near-optimal.
  • For a more complete set of results on how SMERL selects policies on the test environments for HalfCheetah-Goal, Walker-Velocity, and Hopper-Velocity, see See Appendix B.4.
Conclusion
  • The authors present a robust RL algorithm, SMERL, for learning RL policies that can extrapolate to out-of-distribution test conditions with only a small number of trials.
  • The authors' diversity-driven learning paradigm suffers from the same issues, as different latent-conditioned policies may not produce reliable behavior when executed in real world settings if the underlying RL algorithm is unstable.
  • The authors leave this as an exciting direction for future work
Summary
  • Deep reinforcement learning (RL) algorithms have demonstrated promising results on a variety of complex tasks, such as robotic manipulation [21, 12] and strategy games [26, 37].
  • The authors' algorithm, Structured Maximum Entropy Reinforcement Learning (SMERL), optimizes the approximate objective on a single training MDP.
  • In order to analyze the connection between the policies learned by SMERL and robustness to test MDPs, the authors consider a related robustness set, defined in terms of sub-optimal policies on the training MDP: Definition 2.
  • In Section 7, the authors will empirically measure SMERL’s robustness when optimizing the SMERL objective on practical problems where the test perturbations satisfy these conditions.
  • The goal of the experimental evaluation is to test the central hypothesis of the work: does structured diversity-driven learning lead to policies that generalize to new MDPs?
  • With a small perturbation magnitude, the optimal policies achieve near-optimal return when executed on the training MDP, and these policies yield the same trajectories when executed on the train and test MDPs, satisfying the definition of the MDP robustness set which the authors can expect SMERL to be robust to.
  • Given that SMERL learns distinguishable and diverse policies in simple environments, the authors study whether these policies are robust to various test conditions in more challenging continuous-control problems.
  • The authors compare SMERL to standard maximum-entropy RL (SAC), an approach that learns multiple diverse policies but does not maximize a reward signal from the environment (DIAYN), a naive combination of SAC and DIAYN (SAC+DIAYN), and Robust Adversarial Reinforcement Learning (RARL), which is a robust RL method.
  • By comparing to SAC and DIAYN, the authors aim to test the importance of learning diverse policies and of ensuring near-optimal return, for achieving robustness.
  • DIAYN learns multiple diverse policies, but since it is trained independently of task reward, it only occasionally solves the task and otherwise produces policies that perform structured diverse behavior but do not achieve near-optimal return.
  • SMERL balances the task reward and DIAYN reward to achieve few-shot robustness, since it only adds the DIAYN reward when the latent policies are near-optimal.
  • For a more complete set of results on how SMERL selects policies on the test environments for HalfCheetah-Goal, Walker-Velocity, and Hopper-Velocity, see See Appendix B.4.
  • The authors present a robust RL algorithm, SMERL, for learning RL policies that can extrapolate to out-of-distribution test conditions with only a small number of trials.
  • The authors' diversity-driven learning paradigm suffers from the same issues, as different latent-conditioned policies may not produce reliable behavior when executed in real world settings if the underlying RL algorithm is unstable.
  • The authors leave this as an exciting direction for future work
Tables
  • Table1: SMERL policy performance and selection on HalfCheetah-Goal+Force test environments
  • Table2: Hyperparameters used for SAC and SMERL for the 2D navigation experiment
  • Table3: Hyperparameters used for SAC, DIAYN, SAC+DIAYN, and SMERL for continuous control experiments
  • Table4: SMERL policy performance and selection on HalfCheetah-Goal+Obstacle test environments
  • Table5: Table 5
  • Table6: SMERL policy performance and selection on WalkerVelocity+Obstacle test environments
  • Table7: SMERL policy performance and selection on WalkerVelocity+Force test environments
  • Table8: SMERL policy performance and selection on HopperVelocity-Goal+Obstacle test environments
  • Table9: SMERL policy performance and selection on HopperVelocity-Goal+Force test environments
Download tables as Excel
Related work
  • Our work is at the intersection of robust reinforcement learning methods and reinforcement learning methods that promote generalization, both of which we review here. Robustness is a long-studied topic in control and reinforcement learning [46, 28, 43] in fields such as robust control, Bayesian reinforcement learning, and risk-sensitive RL [4, 2]. Works in these areas typically focus on linear systems or finite MDPs, while we aim to study high-dimensional continuous control tasks with complex non-linear dynamics. Recent works have aimed to bring this rich body of work to modern deep reinforcement learning algorithms by using ensembles of models [32, 19], distributions over critics [38, 1], or surrogate reward estimation [42] to represent and reason about uncertainty. These methods assume that the conditions encountered during training are representative of those during testing, an assumption also common in works that study generalization in reinforcement learning [3, 17] and domain randomization [33, 40] . We instead focus specifically on extrapolation, and develop an algorithm that generalizes to new, out-of-distribution dynamics after training in a single MDP.
Funding
  • Saurabh Kumar is supported by an NSF Graduate Research Fellowship and the Stanford Knight Hennessy Fellowship
  • Aviral Kumar is supported by the DARPA Assured Autonomy Program
Reference
  • Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan. Quantile qt-opt for risk-aware vision-based robotic grasping. arXiv preprint arXiv:1910.02787, 2019.
    Findings
  • Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decisionmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015.
    Google ScholarLocate open access versionFindings
  • Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
    Findings
  • Erick Delage and Shie Mannor. Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203–213, 2010.
    Google ScholarLocate open access versionFindings
  • Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
    Findings
  • Rasool Fakoor, Pratik Chaudhari, Stefano Soatto, and Alexander J Smola. Meta-q-learning. arXiv preprint arXiv:1910.00125, 2019.
    Findings
  • Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018.
    Findings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
    Findings
  • Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615, 2019.
    Findings
  • Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
    Findings
  • Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640, 2018.
    Findings
  • Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Metareinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, pages 5302–5311, 2018.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
    Findings
  • Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, and Katja Hofmann. Generalization in reinforcement learning with selective noise injection and information bottleneck. In Advances in Neural Information Processing Systems, pages 13956–13968, 2019.
    Google ScholarLocate open access versionFindings
  • Allan Jabri, Kyle Hsu, Abhishek Gupta, Ben Eysenbach, Sergey Levine, and Chelsea Finn. Unsupervised curricula for visual meta-reinforcement learning. In Advances in Neural Information Processing Systems, pages 10519–10530, 2019.
    Google ScholarLocate open access versionFindings
  • Gregory Kahn, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertaintyaware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
    Findings
  • Louis Kirsch, Sjoerd van Steenkiste, and Jürgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019.
    Findings
  • Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
    Google ScholarLocate open access versionFindings
  • Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4213–4220, 2019.
    Google ScholarLocate open access versionFindings
  • Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pages 3812–3822, 2017.
    Google ScholarLocate open access versionFindings
  • Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
    Findings
  • Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711, 2018.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    Findings
  • Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through metareinforcement learning. International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
    Google ScholarLocate open access versionFindings
  • Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 8522–8528. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2040–2042. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
    Findings
  • Fereshteh Sadeghi and Sergey Levine. Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201, 2016.
    Findings
  • S Schulze, S Whiteson, L Zintgraf, M Igl, Yarin Gal, K Shiarlis, and K Hofmann. Varibad: a very good method for bayes-adaptive deep rl via meta-learning. International Conference on Learning Representations.
    Google ScholarFindings
  • Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Policy gradients with parameter-based exploration for control. In International Conference on Artificial Neural Networks, pages 387–396.
    Google ScholarLocate open access versionFindings
  • Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamicsaware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
    Findings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
    Google ScholarLocate open access versionFindings
  • Yichuan Charlie Tang, Jian Zhang, and Ruslan Salakhutdinov. Worst cases policy gradients. arXiv preprint arXiv:1911.03618, 2019.
    Findings
  • Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. arXiv preprint arXiv:1901.09184, 2019.
    Findings
  • Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. arXiv preprint arXiv:1810.01032, 2018.
    Findings
  • Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
    Google ScholarLocate open access versionFindings
  • Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453, 2017.
    Findings
  • Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018.
    Findings
  • Kemin Zhou and John Comstock Doyle. Essentials of robust control, volume 104. Prentice hall Upper Saddle River, NJ, 1998.
    Google ScholarFindings
Author
Your rating :
0

 

Tags
Comments
小科