## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Error Bounds of Imitating Policies and Environments

NIPS 2020, (2020)

EI

关键词

摘要

Imitation learning trains a policy by mimicking expert demonstrations. Various imitation methods were proposed and empirically evaluated, meanwhile, their theoretical understanding needs further studies. In this paper, we firstly analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral c...更多

代码：

数据：

简介

- Sequential decision-making under uncertainty is challenging due to the stochastic dynamics and delayed feedback [27, 8].
- Ho and Ermon [23] revealed that AL can be viewed as a state-action occupancy measure matching problem.
- From this connection, they proposed the algorithm generative adversarial imitation learning (GAIL).
- An infinite-horizon Markov decision process (MDP) [46, 38] is described by a tuple M = (S, A, M ∗, R, γ, d0), where S is the state space, A is the action space, and d0 specifies the initial state distribution.

重点内容

- Sequential decision-making under uncertainty is challenging due to the stochastic dynamics and delayed feedback [27, 8]
- The results indicate that the environment model learning through adversarial approaches enjoys a linear policy evaluation error w.r.t. the model-bias, which improves the previous quadratic results [31, 25] and suggests a promising application of generative adversarial imitation learning (GAIL) for model-based reinforcement learning
- We briefly introduce two popular methods considered in this paper, behavioral cloning (BC) [37] and generative adversarial imitation learning (GAIL) [23]
- This paper presents error bounds of BC and GAIL for imitating-policies and imitating-environments in the infinite horizon setting, mainly showing that GAIL can achieve a linear dependency on the effective horizon while BC has a quadratic dependency
- We would like to highlight that the result of the paper may shed some light for model-based reinforcement learning (MBRL)
- Our analysis suggests that the BC-like transition learner can be replaced by a GAIL-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approach in [44, 43] can work well

方法

- 6.1 Imitating Policies

The authors evaluate imitation learning methods on three MuJoCo benchmark tasks in OpenAI Gym [10], where the agent aims to mimic locomotion skills. - The authors consider the following approaches: BC [37], DAgger [40], GAIL [23], maximum entropy IRL algorithm AIRL [17] and apprenticeship learning algorithms FEM [1] and GTAL [47].
- FEM and GTAL are based on the improved versions proposed in [24].
- All experiments run with 3 random seeds.
- Experiment details are given in Appendix E.1

结论

- This paper presents error bounds of BC and GAIL for imitating-policies and imitating-environments in the infinite horizon setting, mainly showing that GAIL can achieve a linear dependency on the effective horizon while BC has a quadratic dependency.
- The authors would like to highlight that the result of the paper may shed some light for model-based reinforcement learning (MBRL).
- Previous MBRL methods mostly involve a BC-like transition learning component that can cause a high model-bias.
- The authors' analysis suggests that the BC-like transition learner can be replaced by a GAIL-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approach in [44, 43] can work well.
- The authors hope this work will inspire future research in this direction

总结

## Introduction:

Sequential decision-making under uncertainty is challenging due to the stochastic dynamics and delayed feedback [27, 8].- Ho and Ermon [23] revealed that AL can be viewed as a state-action occupancy measure matching problem.
- From this connection, they proposed the algorithm generative adversarial imitation learning (GAIL).
- An infinite-horizon Markov decision process (MDP) [46, 38] is described by a tuple M = (S, A, M ∗, R, γ, d0), where S is the state space, A is the action space, and d0 specifies the initial state distribution.
## Methods:

6.1 Imitating Policies

The authors evaluate imitation learning methods on three MuJoCo benchmark tasks in OpenAI Gym [10], where the agent aims to mimic locomotion skills.- The authors consider the following approaches: BC [37], DAgger [40], GAIL [23], maximum entropy IRL algorithm AIRL [17] and apprenticeship learning algorithms FEM [1] and GTAL [47].
- FEM and GTAL are based on the improved versions proposed in [24].
- All experiments run with 3 random seeds.
- Experiment details are given in Appendix E.1
## Conclusion:

This paper presents error bounds of BC and GAIL for imitating-policies and imitating-environments in the infinite horizon setting, mainly showing that GAIL can achieve a linear dependency on the effective horizon while BC has a quadratic dependency.- The authors would like to highlight that the result of the paper may shed some light for model-based reinforcement learning (MBRL).
- Previous MBRL methods mostly involve a BC-like transition learning component that can cause a high model-bias.
- The authors' analysis suggests that the BC-like transition learner can be replaced by a GAIL-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approach in [44, 43] can work well.
- The authors hope this work will inspire future research in this direction

- Table1: List of f -divergences
- Table2: Information about tasks in imitating policies
- Table3: Key parameters of Behavioral Cloning
- Table4: Key parameters of DAgger
- Table5: Key parameters of GAIL, AIRL, FEM and GTAL
- Table6: Discounted returns of learned policies. We use ± to denote the standard deviation γ = 0.9 γ = 0.99 γ = 0.999

相关工作

- In the domain of imitating policies, prior studies [39, 48, 40, 12] considered the finite-horizon setting and revealed that behavioral cloning [37] leads to the compounding errors (i.e., an optimality gap of O(T 2), where T is the horizon length). DAgger [40] improved this optimality gap to O(T ) at the cost of additional expert queries. Recently, based on generative adversarial network (GAN) [20], generative adversarial imitation learning [23] was proposed and had achieved much empirical success [17, 28, 29, 11]. Though many theoretical results have been established for GAN [5, 54, 3, 26], the theoretical properties of GAIL are not well understood. To the best of our knowledge, only until recently, there emerged studies towards understanding the generalization and computation properties of GAIL [13, 55]. The closest work to ours is [13], where the authors considered the generalization ability of GAIL under a finite-horizon setting with complete expert trajectories. In particular, they analyzed the generalization ability of the proposed R-distance but they did not provide the bound for policy value gap, which is of interest in practice. On the other hand, the global convergence properties with neural network function approximation were further analyzed in [55].

基金

- This work is supported by National Key R&D Program of China (2018AAA0101100), NSFC (61876077), and Collaborative Innovation Center of Novel Software Technology and Industrialization

引用论文

- Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML’04), pages 1–8, 2004.
- Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17), 2017.
- Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17), 2017.
- Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, (ICML’17), pages 214–223, 2017.
- Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning (ICML’17), pages 224–232, 2017.
- Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in model-based reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML’18), pages 264–273, 2018.
- Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):325–349, 2013.
- Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML’17), pages 263–272, 2017.
- Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems 30 (NeurIPS’17), pages 6240–6249, 2017.
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv, 1606.01540, 2016.
- Xin-Qiang Cai, Yao-Xiang Ding, Yuan Jiang, and Zhi-Hua Zhou. Imitation learning from pixel-level demonstrations by hashreward. arXiv, 1909.03773, 2020.
- Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. Learning to search better than your teacher. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), pages 2058–2066, 2015.
- Minshuo Chen, Yizhou Wang, Tianyi Liu, Zhuoran Yang, Xingguo Li, Zhaoran Wang, and Tuo Zhao. On computation and generalization of generative adversarial imitation learning. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20), 2020.
- Felipe Codevilla, Matthias Miiller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA’18), pages 1–9, 2018.
- Imre Csiszar and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
- Imre Csiszár and Paul C. Shields. Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1(4), 2004.
- Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), 2018.
- Seyed Kamyar Seyed Ghasemipour, Richard S. Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Conference on Robot Learning (CoRL’19), 2019.
- Alessandro Giusti, Jerome Guzzi, Dan C. Ciresan, Fang-Lin He, Juan P. Rodriguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, Davide Scaramuzza, and Luca Maria Gambardella. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016.
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NeurIPS’14), pages 2672–2680, 2014.
- Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems 30 (NeurIPS’17), pages 5767–5777, 2017.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML’18), pages 1856–1865, 2018.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29 (NeurIPS’16), pages 4565–4573, 2016.
- Jonathan Ho, Jayesh K. Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization. In Proceedings of the 33nd International Conference on Machine Learning (ICML’16), pages 2760–2769, 2016.
- Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems 32 (NeurIPS’19), pages 12498–12509, 2019.
- Haoming Jiang, Zhehui Chen, Minshuo Chen, Feng Liu, Dingding Wang, and Tuo Zhao. On computation and generalization of generative adversarial networks under spectrum control. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
- Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
- Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
- Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20), 2020.
- F. Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transaction on Information Theory, 52(10):4394–4412, 2006.
- Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
- Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
- Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00), pages 663– 670, 2000.
- Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29 (NeurIPS’16), pages 271–279, 2016.
- Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
- Dean Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
- Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
- Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the 13rd International Conference on Artificial Intelligence and Statistics (AISTATS’10), pages 661–668, 2010.
- Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), pages 627–635, 2011.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), pages 1889–1897, 2015.
- Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory To Algorithms. Cambridge University Press, 2014.
- Wenjie Shang, Yang Yu, Qingyang Li, Zhiwei Qin, Yiping Meng, and Jieping Ye. Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’19), pages 566–576, 2019.
- Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and Anxiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’19), pages 4902–4909, 2019.
- David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
- Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems 20 (NeurIPS’07), pages 1449–1456, 2007.
- Umar Syed and Robert E. Schapire. A reduction from apprenticeship learning to classification. In Advances in Neural Information Processing Systems 23 (NeurIPS’10), pages 2253–2261, 2010.
- Umar Syed, Michael H. Bowling, and Robert E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the 22nd International Conference on Machine Learning (ICML’08), pages 1032–1039, 2008.
- Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), pages 4950–4957, 2018.
- Arun Venkatraman, Martial Hebert, and J. Andrew Bagnell. Improving multi-step prediction of learned time series models. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15), pages 3024–3030, 2015.
- Yang Yu. Towards sample efficient reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), pages 5739–5743, 2018.
- Chao Zhang, Yang Yu, and Zhi-Hua Zhou. Learning environmental calibration actions for policy self-evolution. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), pages 3061–3067, 2018.
- Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discrimination-generalization tradeoff in GANs. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), 2018.
- Yufeng Zhang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Generative adversarial imitation learning with neural networks: Global optimality and convergence rate. arXiv, 2003.03709, 2020.
- 1. We first show that Gθ 1 is bounded as

标签

评论