# Regularized Inverse Reinforcement Learning

international conference on learning representations, 2020.

Weibo:

Abstract:

Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior by acquiring reward functions that explain the expert's decisions. Regularized IRL applies convex regularizers to the learner's policy in order to avoid the expert's behavior being rationalized by arbitrary constant rewards, also known...More

Code:

Data:

Introduction

- Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al, 2015; 2016) and robot control (Schulman et al, 2015; Fujimoto et al, 2018; Haarnoja et al, 2018).
- Since RL requires a given or known reward function, Inverse Reinforce Learning (IRL) (Russell, 1998; Ng et al, 2000)—the problem of acquiring a reward function that promotes expert-like behavior—is more generally adopted in practical scenarios like robotic manipulation (Finn et al, 2016b), autonomous driving (Sharifzadeh et al, 2016; Wu et al, 2020) and clinical motion analysis (Li et al, 2018)
- In these scenarios, defining a reward function beforehand is challenging and IRL is more pragmatic.
- Complications with IRL in unregularized MDPs relate to the issue of degeneracy, where any constant function can rationalize the expert’s behavior (Ng et al, 2000)

Highlights

- Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al, 2015; 2016) and robot control (Schulman et al, 2015; Fujimoto et al, 2018; Haarnoja et al, 2018)
- We summarize our contributions as follows: unlike the solutions in Geist et al (2019), we propose tractable solutions for regularized Inverse Reinforce Learning (IRL) problems that can be derived from policy regularization and its gradient in discrete control problems (Section 3.1)
- We theoretically derive its solution and show that learning with these rewards is equivalent to a specific instance of imitation learning—i.e., one that minimizes the Bregman divergence associated with policy regularizers
- We propose Regularized Adversarial Inverse Reinforcement Learning (RAIRL)—a practical sampled-based IRL algorithm in regularized Markov decision processes (MDPs)—and evaluate its applicability on policy imitation and reward acquisition
- Recent advances in imitation learning and IRL are built from the perspective of regarding imitation learning as statistical divergence minimization problems (Ke et al, 2019; Ghasemipour et al, 2019)
- We believe that considering RL with policy regularization different from Geist et al (2019) and its inverse problem is a possible way of finding the links between imitation learning and various statistical distances

Methods

- The authors summarize the experimental setup as follows. In the experiments, the authors consider Ω(p) = −λEa∼p[φ(p(a)] with the following regularizers from Yang et al (2019): (1) Shannon entropy (φ(x) − log x),

(2) Tsallis entropy regularizer xq−1)), (3)

exp regularizer e

ex), cos regularizer cos(

π 2 x)), sin regularizer sin π 2 x).

The above regularizers were chosen since other regularizers have not been empirically validated to the best of the knowledge. - The authors consider Ω(p) = −λEa∼p[φ(p(a)] with the following regularizers from Yang et al (2019): (1) Shannon entropy (φ(x) − log x),.
- (2) Tsallis entropy regularizer xq−1)), (3).
- The above regularizers were chosen since other regularizers have not been empirically validated to the best of the knowledge.
- The authors chose those regularizers to make the empirical analysis more tractable.

Conclusion

- The authors consider the problem of IRL in regularized MDPs (Geist et al, 2019), assuming a class of strongly convex policy regularizers.
- The authors theoretically derive its solution and show that learning with these rewards is equivalent to a specific instance of imitation learning—i.e., one that minimizes the Bregman divergence associated with policy regularizers.
- The authors believe that considering RL with policy regularization different from Geist et al (2019) and its inverse problem is a possible way of finding the links between imitation learning and various statistical distances

Summary

## Introduction:

Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al, 2015; 2016) and robot control (Schulman et al, 2015; Fujimoto et al, 2018; Haarnoja et al, 2018).- Since RL requires a given or known reward function, Inverse Reinforce Learning (IRL) (Russell, 1998; Ng et al, 2000)—the problem of acquiring a reward function that promotes expert-like behavior—is more generally adopted in practical scenarios like robotic manipulation (Finn et al, 2016b), autonomous driving (Sharifzadeh et al, 2016; Wu et al, 2020) and clinical motion analysis (Li et al, 2018)
- In these scenarios, defining a reward function beforehand is challenging and IRL is more pragmatic.
- Complications with IRL in unregularized MDPs relate to the issue of degeneracy, where any constant function can rationalize the expert’s behavior (Ng et al, 2000)
## Methods:

The authors summarize the experimental setup as follows. In the experiments, the authors consider Ω(p) = −λEa∼p[φ(p(a)] with the following regularizers from Yang et al (2019): (1) Shannon entropy (φ(x) − log x),

(2) Tsallis entropy regularizer xq−1)), (3)

exp regularizer e

ex), cos regularizer cos(

π 2 x)), sin regularizer sin π 2 x).

The above regularizers were chosen since other regularizers have not been empirically validated to the best of the knowledge.- The authors consider Ω(p) = −λEa∼p[φ(p(a)] with the following regularizers from Yang et al (2019): (1) Shannon entropy (φ(x) − log x),.
- (2) Tsallis entropy regularizer xq−1)), (3).
- The above regularizers were chosen since other regularizers have not been empirically validated to the best of the knowledge.
- The authors chose those regularizers to make the empirical analysis more tractable.
## Conclusion:

The authors consider the problem of IRL in regularized MDPs (Geist et al, 2019), assuming a class of strongly convex policy regularizers.- The authors theoretically derive its solution and show that learning with these rewards is equivalent to a specific instance of imitation learning—i.e., one that minimizes the Bregman divergence associated with policy regularizers.
- The authors believe that considering RL with policy regularization different from Geist et al (2019) and its inverse problem is a possible way of finding the links between imitation learning and various statistical distances

- Table1: Policy regularizers φ and their corresponding fφ (<a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al, 2019</a>)
- Table2: Hyperparameters for Bandit environments
- Table3: Hyperparameters for Bermuda World environment
- Table4: Hyperparameters for MuJoCo environments

Reference

- Shun-Ichi Amari. α-divergence is unique, belonging to both f -divergence and bregman divergence classes. IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.
- Abdeslam Boularias and Brahim Chaib-Draa. Bootstrapping apprenticeship learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 289–297, 2010.
- Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
- Imre Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitat von markoffschen ketten. Magyar. Tud. Akad. Mat. Kutató Int. Közl, 8:85–108, 1963.
- Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal wasserstein imitation learning. arXiv preprint arXiv:2006.04678, 2020.
- Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016a.
- Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 49–58, 2016b.
- Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
- Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1582–1591, 2018.
- Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 2160–2169, 2019.
- Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Conference on Robot Learning (CoRL), 2019.
- Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680, 2014.
- Xin Guo, Johnny Hong, and Nan Yang. Ambiguity set and learning via Bregman and Wasserstein. arXiv preprint arXiv:1705.08056, 2017.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1861–1870, 2018.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4565–4573, 2016.
- Lee K Jones and Charles L Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Transactions on Information Theory, 36(1):23–30, 1990.
- Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, Sanjiban Choudhury, and Siddhartha Srinivasa. Imitation learning as f -divergence minimization. arXiv preprint arXiv:1905.12888, 2019.
- Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Maximum causal tsallis entropy imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4403–4413, 2018.
- Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh. Tsallis reinforcement learning: A unified framework for maximum entropy reinforcement learning. arXiv preprint arXiv:1902.00137, 2019.
- Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, Mineui Hong, Jaein Kim, Yong-Lae Park, and Songhwai Oh. Generalized Tsallis entropy reinforcement learning and its application to soft mobile robots. Robotics: Science and Systems Foundation, 2020.
- Kun Li, Mrinal Rath, and Joel W Burdick. Inverse reinforcement learning via function approximation for clinical motion analysis. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 610–617. IEEE, 2018.
- Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 2391–2400, 2017.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1928–1937, 2016.
- Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML), pp. 278–287, 1999.
- Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 663–670, 2000.
- Frank Nielsen and Richard Nock. On Rényi and Tsallis entropies and divergences for exponential families. arXiv preprint arXiv:1105.3259, 2011.
- Stuart Russell. Learning agents for uncertain environments. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 101–103, 1998.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 1889–1897, 2015.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sahand Sharifzadeh, Ioannis Chiotellis, Rudolph Triebel, and Daniel Cremers. Learning to drive using inverse reinforcement learning and deep Q-networks. arXiv preprint arXiv:1612.03653, 2016.
- Adam Stooke and Pieter Abbeel. rlpyt: A research code base for deep reinforcement learning in pytorch. arXiv preprint arXiv:1909.01500, 2019.
- Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 1032–1039, 2008.
- Zheng Wu, Liting Sun, Wei Zhan, Chenyu Yang, and Masayoshi Tomizuka. Efficient sampling-based maximum entropy inverse reinforcement learning with application to autonomous driving. IEEE Robotics and Automation Letters, 5(4):5355–5362, 2020.
- Wenhao Yang, Xiang Li, and Zhihua Zhang. A regularized approach to sparse optimal policy in reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5938–5948, 2019.
- Under review as a conference paper at ICLR 2021 Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2019. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, volume 8, pp. 1433–1438, 2008.
- It should be noted that Proposition 1 in Geist et al. (2019) tells us ∇Ω∗(Q(s, ·)) is a policy that uniquely maximizes Eq.(15). For example, when Ω(π(·|s)) = a∼π(·|s) log π(a|s) (negative Shannon entropy), ∇Ω∗(Q(s, ·)) is a softmax policy, i.e., ∇Ω∗(Q(s, ·)) =
- Note that for two continuous distributions P1 and P2 having probability density functions p1(x) and p2(x), respectively, the Bregman divergence can be defined as (Guo et al., 2017; Jones & Byrne, 1990)

Full Text

Tags

Comments