AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
This paper uses meta-gradients to perform soft-constrained Reinforcement Learning (RL) optimization
Balancing Constraints and Rewards with Meta-Gradient D4PG
Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists...More
PPT (Upload PPT)
- Reinforcement Learning (RL) algorithms typically try to maximize an expected return objective (Sutton & Barto, 2018).
- Formulating real-world problems with only an expected return objective is often sub-optimal when tackling many applied problems ranging from recommendation systems to physical control systems which may include robots, self-driving cars and even aerospace technologies.
- In many of these domains there are a variety of challenges preventing RL from being utilized as the algorithmic solution framework.
- The function f : Rk ! Rd is the gradient of the policy and/or value function with respect to the parameters ✓ and is a function of an n-step trajectory
- Reinforcement Learning (RL) algorithms typically try to maximize an expected return objective (Sutton & Barto, 2018)
- Our main contributions are as follows: (1) We extend D4PG to handle constraints by adapting it to Reward Constrained Policy Optimization (RCPO) (Tessler et al, 2018) yielding Reward Constrained D4PG (RC-D4PG); (2) We present a soft constrained meta-gradient technique: Meta-Gradients for the Lagrange multiplier learning rate (MetaL)1; (3) We derive the meta-gradient update for MetaL (Theorem 1); (4) We perform extensive experiments and investigative studies to showcase the properties of this algorithm
- We presented a soft-constrained RL technique called MetaL that combines metagradients and constrained RL to find a good trade-off between minimizing constraint violations and maximizing returns
- The algorithm, derived meta-gradient update and a comparison to MetaL can be found in the Appendix, Section B
- We show that across safety coefficients, domains and constraint thresholds, MetaL outperforms all of the baseline algorithms
- We derive the meta-gradient updates for MetaL and perform an investigative study where we provide empirical intuition for the derived gradient update that helps explain this meta-gradient variant’s performance
- The experiments were performed using domains from the Real-World Reinforcement Learning (RWRL) suite, namely cartpole:swingup, walker:walk, quadruped:walk and humanoid:walk.
- The authors will refer to these domains as cartpole, walker, quadruped and humanoid from here on in.
- Unsolvable constraint tasks correspond to tasks where the constraint thresholds are incorrectly set and cannot be satisfied, situations which occur in many real-world problems as motivated in the introduction.
- The goal is to showcase the soft-constrained performance of MetaL, with respect to reducing constraint violations and maximizing the return in both of these scenarios with respect to the baselines
- The authors begin by analyzing the performance of the best variant, MetaL, with different outer losses.
- MetaL outer loss: The authors wanted to determine whether different outer losses would result in improved overall performance.
- The authors used the actor loss (Lactor) and the combination of the actor and critic losses as the outer loss (Lactor + Lcritic) and compared them with the original MetaL outer loss (Lcritic) as well as the other baselines.
- The best performance is achieved by the original critic-only MetaL outer loss
- The authors presented a soft-constrained RL technique called MetaL that combines metagradients and constrained RL to find a good trade-off between minimizing constraint violations and maximizing returns.
- The authors implemented a meta-gradient approach called MeSh that scales and offsets the shaped rewards
- This approach did not outperform MetaL but is a direction of future work.
- The authors believe the proposed techniques will generalize to other policy gradient algorithms but leave this for future work
- Table1: Overall performance across domains, safety coefficients and thresholds
- 4MetaL’s penalized reward (Rpenalized) performance is significantly better than the baselines with all pvalues smaller than 10 9 using Welch’s t-test
- Performance per domain: When analyzing the performance per domain, averaging across safety coefficients and constraint thresholds, we found that MetaL has significantly better penalized return compared to D4PG and RC-D4PG across the domains
- Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin A. Riedmiller. Relative entropy regularized policy iteration. CoRR, abs/1812.02256, 2018.
- Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 22–31. JMLR. org, 2017.
- Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
- Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
- Marc G Bellemare, Will Dabney, and Remi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. JMLR. org, 2017.
- Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, and Raia Hadsell. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623, 2019.
- Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning, 2018.
- Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. CoRR, abs/1904.12901, 2019.
- Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881, 2020a.
- Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learning, 2020b.
- Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps, 2020.
- Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization, 2017.
- Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pp. 7555–7565, 2019.
- Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 2019.
- Harsh Satija, Philip Amortila, and Joelle Pineau. Constrained markov decision processes via backward value functions. arXiv preprint arXiv:2008.11811, 2020.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550, 2017.
- Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22 (2):276–295, 2017.