# Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Adversarial imitation learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial trainin...More

Code:

Data:

Introduction

- Imitation Learning (IL) treats the task of learning a policy from a set of expert demonstrations.
- Behavioral cloning casts IL as a supervised learning objective and seeks to imitate the expert’s actions using the provided demonstrations as a fixed dataset [19].
- This usually requires a lot of expert data and results in agents that struggle to generalize.

Highlights

- Imitation Learning (IL) treats the task of learning a policy from a set of expert demonstrations
- IL is effective on control problems that are challenging for traditional reinforcement learning methods, either due to reward function design challenges or the inherent difficult of the task itself [1, 23]
- In addition to drastically simplifying the adversarial inverse reinforcement learning (IRL) framework, our methods perform on par or better than previous approaches on all but one environment
- While Generative Adversarial Imitation Learning (GAIL) was originally proposed without gradient penalty (GP) [13], we empirically found that GP prevents the discriminator to overfit and enables reinforcement learning (RL) to exploit dense rewards, which highly improves its sample efficiency
- We propose an important simplification to the adversarial inverse reinforcement learning framework by removing the reinforcement learning optimisation loop altogether
- We evaluate our approach against prior works on many different benchmarking tasks and show that our method (ASAF) compares favorably to the predominant imitation learning algorithms

Methods

**Experiments on classic control and**

Box2D tasks

Figure 1 shows that ASAF and its approximate variations ASAF-1 and ASAF-w quickly converge to expert’s performance.- For GAIL and AIRL, this is likely due to the concurrent RL and IRL loops, whereas for SQIL, it has been noted that an effective reward decay can occur when accurately mimicking the expert [20]
- This instability is severe in the continuous control case.
- To scale up the evaluations in continuous control the authors use the popular MuJoCo simulator
- In this domain, the trajectory length is either fixed at a large value (1000 steps on HalfCheetah) or varies a lot across episodes (Hopper and Walker2d).

Results

**Results and discussion**

The authors evaluate the methods on a variety of discrete and continuous control tasks.- 4 shows that ASQF performs well on small scale environments but struggles and eventually fails on more complicated environments
- It seems that ASQF does not scale well with the observation space size.
- For each state, several transitions with different actions are required in order to learn it
- Approximating this partition function could lead to assigning too low a probability to expert-like actions and eventually failing to behave appropriately.
- ASAF on the other hand explicitly learns the probability of an action given the state – in other word it explicitly learns the partition function – and is immune to that problem

Conclusion

- The authors propose an important simplification to the adversarial inverse reinforcement learning framework by removing the reinforcement learning optimisation loop altogether.
- By using a particular form for the discriminator, the method recovers a policy that matches the expert’s trajectory distribution.
- The authors evaluate the approach against prior works on many different benchmarking tasks and show that the method (ASAF) compares favorably to the predominant imitation learning algorithms.
- The authors' approach still involves a reward learning module through its discriminator, and it would be interesting in future work to explore how ASAF can be used to learn robust rewards along the lines of Fu et al [6].

Summary

## Introduction:

Imitation Learning (IL) treats the task of learning a policy from a set of expert demonstrations.- Behavioral cloning casts IL as a supervised learning objective and seeks to imitate the expert’s actions using the provided demonstrations as a fixed dataset [19].
- This usually requires a lot of expert data and results in agents that struggle to generalize.
## Objectives:

Since the aim is to propose efficient yet simple methods, the authors focus on the Generative Adversarial formulation and the MaxEnt IRL framework.## Methods:

**Experiments on classic control and**

Box2D tasks

Figure 1 shows that ASAF and its approximate variations ASAF-1 and ASAF-w quickly converge to expert’s performance.- For GAIL and AIRL, this is likely due to the concurrent RL and IRL loops, whereas for SQIL, it has been noted that an effective reward decay can occur when accurately mimicking the expert [20]
- This instability is severe in the continuous control case.
- To scale up the evaluations in continuous control the authors use the popular MuJoCo simulator
- In this domain, the trajectory length is either fixed at a large value (1000 steps on HalfCheetah) or varies a lot across episodes (Hopper and Walker2d).
## Results:

**Results and discussion**

The authors evaluate the methods on a variety of discrete and continuous control tasks.- 4 shows that ASQF performs well on small scale environments but struggles and eventually fails on more complicated environments
- It seems that ASQF does not scale well with the observation space size.
- For each state, several transitions with different actions are required in order to learn it
- Approximating this partition function could lead to assigning too low a probability to expert-like actions and eventually failing to behave appropriately.
- ASAF on the other hand explicitly learns the probability of an action given the state – in other word it explicitly learns the partition function – and is immune to that problem
## Conclusion:

The authors propose an important simplification to the adversarial inverse reinforcement learning framework by removing the reinforcement learning optimisation loop altogether.- By using a particular form for the discriminator, the method recovers a policy that matches the expert’s trajectory distribution.
- The authors evaluate the approach against prior works on many different benchmarking tasks and show that the method (ASAF) compares favorably to the predominant imitation learning algorithms.
- The authors' approach still involves a reward learning module through its discriminator, and it would be interesting in future work to explore how ASAF can be used to learn robust rewards along the lines of Fu et al [6].

- Table1: Fixed Hyperparameters for classic control tasks
- Table2: Best found hyper-parameters for Cartpole
- Table3: Best found hyper-parameters for Mountaincar
- Table4: Best found hyper-parameters for Lunarlander
- Table5: Best found hyper-parameters for Pendulum
- Table6: Best found hyper-parameters for Mountaincar-c
- Table7: Best found hyper-parameters for Lunarlander-c
- Table8: Hyperparameters for MuJoCo environments
- Table9: Fixed Hyperparameters for Pommerman Random-Tag environment
- Table10: Figure 3 uses these configurations retrained on 10 seeds. Best found hyper-parameters for the Pommerman Random-Tag environment
- Table11: Expert demonstrations used for Imitation Learning

Related work

- Ziebart et al [29] first proposed MaxEnt IRL, the foundation of modern IL. Ziebart [28] further elaborated MaxEnt IRL as well as deriving the optimal form of the MaxEnt policy at the core of our methods. Finn et al [4] proposed a GAN formulation to IRL that leveraged the energy based models of Ziebart [28]. Finn et al [5]’s implementation of this method, however, relied on processing full trajectories with Linear Quadratic Regulator and on optimizing with guided policy search, to manage the high variance of trajectory costs. To retrieve robust rewards, Fu et al [6] proposed a straightforward transposition of [4] to state-action transitions. In doing so, they had to however do away with a GAN objective during policy optimization, consequently minimizing the Kullback–Leibler divergence from the expert occupancy measure to the policy occupancy measure (instead of the Jensen-Shannon divergence) [7].

Funding

- We would also like to thank Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montreal and Mitacs Accelerate Program for providing funding for this work as well as Compute Canada for providing the computing resources

Reference

- Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page ACM, 2004.
- Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the 29th International Conference on Machine Learning, pages 179–186, 2012.
- Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. In Advances in Neural Information Processing Systems, pages 15298–15309, 2019.
- Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016.
- Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
- Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
- Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Conference on Robot Learning, 2019.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
- Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.
- Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565–4573, 2016.
- Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hk4fpoA5Km.
- Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr.
- Alex Kuefler, Jeremy Morton, Tim Wheeler, and Mykel Kochenderfer. Imitating driver behavior with generative adversarial networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 204–211. IEEE, 2017.
- Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-PCL: An offpolicy trust region method for continuous control. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyrCWeWCb.
- Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Efficient estimation of off-policy stationary distribution corrections. 2019.
- Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991.
- Siddharth Reddy, Anca D. Dragan, and Sergey Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards, 2019.
- Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multi-agent playground. arXiv preprint arXiv:1809.07124, 2018.
- Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, pages 1530–1538, 2015.
- Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
- Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample efficient imitation learning for continuous control. 2018.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Hongwei Zhou, Yichen Gong, Luvneesh Mugrai, Ahmed Khalifa, Andy Nealen, and Julian Togelius. A hybrid search agent in pommerman. In Proceedings of the 13th International Conference on the Foundations of Digital Games, pages 1–4, 2018.
- Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, figshare, 2010.
- Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.

Full Text

Tags

Comments