Unified Policy Optimization for Robust Reinforcement Learning

ACML, pp. 395-410, 2019.

Cited by: 0|Bibtex|Views22
EI
Other Links: dblp.uni-trier.de
Weibo:
We aim to address the robustness across different tasks for policy optimization methods

Abstract:

Recent years have witnessed significant progress in solving challenging problems across various domains using deep reinforcement learning (RL). Despite the success, the weak robustness has risen as a big obstacle for applying existing RL algorithms into real problems. In this paper, we propose unified policy optimization (UPO), a sample-e...More

Code:

Data:

0
Introduction
  • Deep reinforcement learning (RL) methods have shown their tremendous success in learning complex skills for agents and solving challenging control tasks in high-dimensional raw sensory statespace (Mnih et al, 2015; Schulman et al, 2015).
  • The great potential of deep RL, cannot conceal a big concern in terms of its weak robustness, which has been a major hurdle for applying deep RL algorithms into real-world problems
  • Such weak robustness is usually specified by inconsistent behaviors of existing RL algorithms given different hyper-parameters, environments, and random initializations (Henderson et al, 2017).
  • RL aims at learning the policy for agents, facing a sequential decision making problem, by interacting with environment.
Highlights
  • Deep reinforcement learning (RL) methods have shown their tremendous success in learning complex skills for agents and solving challenging control tasks in high-dimensional raw sensory statespace (Mnih et al, 2015; Schulman et al, 2015)
  • We propose unified policy optimization-multi-arm bandit and unified policy optimization-evolution strategies to increase the robustness across different tasks
  • We aim to address the robustness across different tasks for policy optimization methods
  • We propose a sampleefficient shared policy framework called unified policy optimization which is able to generate many gradients on a batch of shared experiences
  • We further propose two algorithms, unified policy optimization-multi-arm bandit and unified policy optimization-evolution strategies
  • Experiments show that our methods achieve strong robustness across different tasks
Methods
  • The authors conduct the robotic locomotion experiments using the MuJoCo simulator (Todorov et al, 2012).
  • The authors use three (n=3) PG algorithms (i.e., TRPO, PPO, ACKTR) for UPO framework Episode Reward Walker2d.
  • 1500 1000 trpo ppo acktr.
  • Reacher trpo ppo acktr UPO-ES UPO-MAB.
  • Hopper trpo ppo acktr UPO-ES.
  • InvertedPendulum trpo ppo acktr UPO-ES in the experiments.
  • Batch-size is set to 2048 for all experiments.
  • The authors conduct hyper-parameter search in the HalfCheetah and Walker2D environments.
  • For UPOMAB, the authors tune hyper-parameter α among {0.01, 0.1, 0.2, 0.3, 0.4}.
Conclusion
  • The state of πm is defined as parameters of current policy, the action of πm is defined as the selection among multiple PG methods, and the reward of each action is defined as the performance gain of new obtained policy over the old one
  • Such meta RL learning process requires amounts of training episodes to learn the meta policy, which is quite inefficient.
  • More research can be done on how to 1) incorporate more PG methods into UPO; 2) use a model to roll out for performance estimation
Summary
  • Introduction:

    Deep reinforcement learning (RL) methods have shown their tremendous success in learning complex skills for agents and solving challenging control tasks in high-dimensional raw sensory statespace (Mnih et al, 2015; Schulman et al, 2015).
  • The great potential of deep RL, cannot conceal a big concern in terms of its weak robustness, which has been a major hurdle for applying deep RL algorithms into real-world problems
  • Such weak robustness is usually specified by inconsistent behaviors of existing RL algorithms given different hyper-parameters, environments, and random initializations (Henderson et al, 2017).
  • RL aims at learning the policy for agents, facing a sequential decision making problem, by interacting with environment.
  • Objectives:

    The authors aim to address the robustness across different tasks for policy optimization methods
  • Methods:

    The authors conduct the robotic locomotion experiments using the MuJoCo simulator (Todorov et al, 2012).
  • The authors use three (n=3) PG algorithms (i.e., TRPO, PPO, ACKTR) for UPO framework Episode Reward Walker2d.
  • 1500 1000 trpo ppo acktr.
  • Reacher trpo ppo acktr UPO-ES UPO-MAB.
  • Hopper trpo ppo acktr UPO-ES.
  • InvertedPendulum trpo ppo acktr UPO-ES in the experiments.
  • Batch-size is set to 2048 for all experiments.
  • The authors conduct hyper-parameter search in the HalfCheetah and Walker2D environments.
  • For UPOMAB, the authors tune hyper-parameter α among {0.01, 0.1, 0.2, 0.3, 0.4}.
  • Conclusion:

    The state of πm is defined as parameters of current policy, the action of πm is defined as the selection among multiple PG methods, and the reward of each action is defined as the performance gain of new obtained policy over the old one
  • Such meta RL learning process requires amounts of training episodes to learn the meta policy, which is quite inefficient.
  • More research can be done on how to 1) incorporate more PG methods into UPO; 2) use a model to roll out for performance estimation
Tables
  • Table1: Average score results over 10 random seeds. UPO-rand represents UPO-random, which randomly picks one algorithm to train the shared policy at each training step; UPO-avg represents UPO-average, which uses average of all gradients to update the shared policy. Higher average normalized score means better robustness
Download tables as Excel
Related work
  • Previous works considered addressing robustness of RL. Pattanaik et al (2018) improve robustness by injecting adversarial attacks. Tretschk et al (2018) add sequential attacks on agents for long-term adversarial goals. Behzadan and Munir (2018) use parameter noise to mitigate policy manipulation attacks. Havens et al (2018) consider online robust policy learning in the presence of unknown adversaries. Schulman et al (2015, 2017) consider address robustness of policy learning via controlling step size. Liu et al (2017) address the robustness of policy initialization. Laroche and Feraud (2017) consider training RL algorithms with different hyper-parameters, thus training a robust policy, in which hyper-parameters are unnecessary to be tuned. None of them considered to solve the weak robustness across different tasks. In our paper, we propose UPO-MAB and UPO-ES to increase the robustness across different tasks.
Reference
  • Vahid Behzadan and Arslan Munir. Mitigation of policy manipulation attacks on deep q-networks with parameter-space noise. arXiv preprint arXiv:1806.02190, 2018.
    Findings
  • Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with nonstationary rewards. In Advances in neural information processing systems, pages 199–207, 2014.
    Google ScholarLocate open access versionFindings
  • Marie-Liesse Cauwet, Jialin Liu, Baptiste Roziere, and Olivier Teytaud. Algorithm portfolios for noisy optimization. Annals of Mathematics and Artificial Intelligence, 76(1-2):143–172, 2016.
    Google ScholarLocate open access versionFindings
  • Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
    Findings
  • Thomas G Dietterichl. Ensemble learning. 2002.
    Google ScholarFindings
  • Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
    Findings
  • Shixiang Shane Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Scholkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3846–3855, 2017.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Aaron J Havens, Zhanhong Jiang, and Soumik Sarkar. Online robust policy learning in the presence of unknown adversaries. arXiv preprint arXiv:1807.06064, 2018.
    Findings
  • Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560, 2017.
    Findings
  • Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5405–5414, 2018.
    Google ScholarLocate open access versionFindings
  • Shauharda Khadka and Kagan Tumer. Evolutionary reinforcement learning. arXiv preprint arXiv:1805.07917, 2018.
    Findings
  • Romain Laroche and Raphael Feraud. Reinforcement learning algorithm selection. arXiv preprint arXiv:1701.08810, 2017.
    Findings
  • Siyuan Li and Chongjie Zhang. An optimal online method of selecting source policies for reinforcement learning. arXiv preprint arXiv:1709.08201, 2017.
    Findings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • Zichuan Lin, Tianqi Zhao, Guangwen Yang, and Lintao Zhang. Episodic memory deep q-networks. arXiv preprint arXiv:1805.07603, 2018.
    Findings
  • Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. arXiv preprint arXiv:1806.05635, 2018.
    Findings
  • Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2040–2042. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
    Findings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Edgar Tretschk, Seong Joon Oh, and Mario Fritz. Sequential attacks on agents for long-term adversarial goals. arXiv preprint arXiv:1805.12487, 2018.
    Findings
  • LIN ZHAO BIAN QIN YANG Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016a. Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016b. Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5285–5294, 2017. Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.
    Findings
Full Text
Your rating :
0

 

Tags
Comments