AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper uses meta-gradients to perform soft-constrained Reinforcement Learning (RL) optimization

Balancing Constraints and Rewards with Meta-Gradient D4PG

ICLR, (2021)

Cited by: 0|Views74
EI
Full Text
Bibtex
Weibo

Abstract

Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists...More

Code:

Data:

0
Introduction
  • Reinforcement Learning (RL) algorithms typically try to maximize an expected return objective (Sutton & Barto, 2018).
  • Formulating real-world problems with only an expected return objective is often sub-optimal when tackling many applied problems ranging from recommendation systems to physical control systems which may include robots, self-driving cars and even aerospace technologies.
  • In many of these domains there are a variety of challenges preventing RL from being utilized as the algorithmic solution framework.
  • The function f : Rk ! Rd is the gradient of the policy and/or value function with respect to the parameters ✓ and is a function of an n-step trajectory
Highlights
  • Reinforcement Learning (RL) algorithms typically try to maximize an expected return objective (Sutton & Barto, 2018)
  • Our main contributions are as follows: (1) We extend D4PG to handle constraints by adapting it to Reward Constrained Policy Optimization (RCPO) (Tessler et al, 2018) yielding Reward Constrained D4PG (RC-D4PG); (2) We present a soft constrained meta-gradient technique: Meta-Gradients for the Lagrange multiplier learning rate (MetaL)1; (3) We derive the meta-gradient update for MetaL (Theorem 1); (4) We perform extensive experiments and investigative studies to showcase the properties of this algorithm
  • We presented a soft-constrained RL technique called MetaL that combines metagradients and constrained RL to find a good trade-off between minimizing constraint violations and maximizing returns
  • The algorithm, derived meta-gradient update and a comparison to MetaL can be found in the Appendix, Section B
  • We show that across safety coefficients, domains and constraint thresholds, MetaL outperforms all of the baseline algorithms
  • We derive the meta-gradient updates for MetaL and perform an investigative study where we provide empirical intuition for the derived gradient update that helps explain this meta-gradient variant’s performance
Methods
  • The experiments were performed using domains from the Real-World Reinforcement Learning (RWRL) suite, namely cartpole:swingup, walker:walk, quadruped:walk and humanoid:walk.
  • The authors will refer to these domains as cartpole, walker, quadruped and humanoid from here on in.
  • Unsolvable constraint tasks correspond to tasks where the constraint thresholds are incorrectly set and cannot be satisfied, situations which occur in many real-world problems as motivated in the introduction.
  • The goal is to showcase the soft-constrained performance of MetaL, with respect to reducing constraint violations and maximizing the return in both of these scenarios with respect to the baselines
Results
  • The authors begin by analyzing the performance of the best variant, MetaL, with different outer losses.
  • MetaL outer loss: The authors wanted to determine whether different outer losses would result in improved overall performance.
  • The authors used the actor loss (Lactor) and the combination of the actor and critic losses as the outer loss (Lactor + Lcritic) and compared them with the original MetaL outer loss (Lcritic) as well as the other baselines.
  • The best performance is achieved by the original critic-only MetaL outer loss
Conclusion
  • The authors presented a soft-constrained RL technique called MetaL that combines metagradients and constrained RL to find a good trade-off between minimizing constraint violations and maximizing returns.
  • The authors implemented a meta-gradient approach called MeSh that scales and offsets the shaped rewards
  • This approach did not outperform MetaL but is a direction of future work.
  • The authors believe the proposed techniques will generalize to other policy gradient algorithms but leave this for future work
Tables
  • Table1: Overall performance across domains, safety coefficients and thresholds
Download tables as Excel
Funding
  • 4MetaL’s penalized reward (Rpenalized) performance is significantly better than the baselines with all pvalues smaller than 10 9 using Welch’s t-test
  • Performance per domain: When analyzing the performance per domain, averaging across safety coefficients and constraint thresholds, we found that MetaL has significantly better penalized return compared to D4PG and RC-D4PG across the domains
Reference
  • Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin A. Riedmiller. Relative entropy regularized policy iteration. CoRR, abs/1812.02256, 2018.
    Findings
  • Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 22–31. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
    Google ScholarFindings
  • Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
    Findings
  • Marc G Bellemare, Will Dabney, and Remi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, and Raia Hadsell. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623, 2019.
    Findings
  • Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
    Google ScholarFindings
  • Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning, 2018.
    Google ScholarFindings
  • Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. CoRR, abs/1904.12901, 2019.
    Findings
  • Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881, 2020a.
    Findings
  • Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforcement learning, 2020b.
    Google ScholarFindings
  • Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps, 2020.
    Google ScholarFindings
  • Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization, 2017.
    Google ScholarFindings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pp. 7555–7565, 2019.
    Google ScholarLocate open access versionFindings
  • Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 2019.
    Findings
  • Harsh Satija, Philip Amortila, and Joelle Pineau. Constrained markov decision processes via backward value functions. arXiv preprint arXiv:2008.11811, 2020.
    Findings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550, 2017.
    Google ScholarLocate open access versionFindings
  • Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22 (2):276–295, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科