AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Our analysis suggests that the behavioral cloning-like transition learner can be replaced by a generative adversarial imitation learning-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approa...

Error Bounds of Imitating Policies and Environments

NIPS 2020, (2020)

被引用0|浏览9
EI
下载 PDF 全文
引用
微博一下

摘要

Imitation learning trains a policy by mimicking expert demonstrations. Various imitation methods were proposed and empirically evaluated, meanwhile, their theoretical understanding needs further studies. In this paper, we firstly analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral c...更多

代码

数据

0
简介
  • Sequential decision-making under uncertainty is challenging due to the stochastic dynamics and delayed feedback [27, 8].
  • Ho and Ermon [23] revealed that AL can be viewed as a state-action occupancy measure matching problem.
  • From this connection, they proposed the algorithm generative adversarial imitation learning (GAIL).
  • An infinite-horizon Markov decision process (MDP) [46, 38] is described by a tuple M = (S, A, M ∗, R, γ, d0), where S is the state space, A is the action space, and d0 specifies the initial state distribution.
重点内容
  • Sequential decision-making under uncertainty is challenging due to the stochastic dynamics and delayed feedback [27, 8]
  • The results indicate that the environment model learning through adversarial approaches enjoys a linear policy evaluation error w.r.t. the model-bias, which improves the previous quadratic results [31, 25] and suggests a promising application of generative adversarial imitation learning (GAIL) for model-based reinforcement learning
  • We briefly introduce two popular methods considered in this paper, behavioral cloning (BC) [37] and generative adversarial imitation learning (GAIL) [23]
  • This paper presents error bounds of BC and GAIL for imitating-policies and imitating-environments in the infinite horizon setting, mainly showing that GAIL can achieve a linear dependency on the effective horizon while BC has a quadratic dependency
  • We would like to highlight that the result of the paper may shed some light for model-based reinforcement learning (MBRL)
  • Our analysis suggests that the BC-like transition learner can be replaced by a GAIL-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approach in [44, 43] can work well
方法
  • 6.1 Imitating Policies

    The authors evaluate imitation learning methods on three MuJoCo benchmark tasks in OpenAI Gym [10], where the agent aims to mimic locomotion skills.
  • The authors consider the following approaches: BC [37], DAgger [40], GAIL [23], maximum entropy IRL algorithm AIRL [17] and apprenticeship learning algorithms FEM [1] and GTAL [47].
  • FEM and GTAL are based on the improved versions proposed in [24].
  • All experiments run with 3 random seeds.
  • Experiment details are given in Appendix E.1
结论
  • This paper presents error bounds of BC and GAIL for imitating-policies and imitating-environments in the infinite horizon setting, mainly showing that GAIL can achieve a linear dependency on the effective horizon while BC has a quadratic dependency.
  • The authors would like to highlight that the result of the paper may shed some light for model-based reinforcement learning (MBRL).
  • Previous MBRL methods mostly involve a BC-like transition learning component that can cause a high model-bias.
  • The authors' analysis suggests that the BC-like transition learner can be replaced by a GAIL-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approach in [44, 43] can work well.
  • The authors hope this work will inspire future research in this direction
总结
  • Introduction:

    Sequential decision-making under uncertainty is challenging due to the stochastic dynamics and delayed feedback [27, 8].
  • Ho and Ermon [23] revealed that AL can be viewed as a state-action occupancy measure matching problem.
  • From this connection, they proposed the algorithm generative adversarial imitation learning (GAIL).
  • An infinite-horizon Markov decision process (MDP) [46, 38] is described by a tuple M = (S, A, M ∗, R, γ, d0), where S is the state space, A is the action space, and d0 specifies the initial state distribution.
  • Methods:

    6.1 Imitating Policies

    The authors evaluate imitation learning methods on three MuJoCo benchmark tasks in OpenAI Gym [10], where the agent aims to mimic locomotion skills.
  • The authors consider the following approaches: BC [37], DAgger [40], GAIL [23], maximum entropy IRL algorithm AIRL [17] and apprenticeship learning algorithms FEM [1] and GTAL [47].
  • FEM and GTAL are based on the improved versions proposed in [24].
  • All experiments run with 3 random seeds.
  • Experiment details are given in Appendix E.1
  • Conclusion:

    This paper presents error bounds of BC and GAIL for imitating-policies and imitating-environments in the infinite horizon setting, mainly showing that GAIL can achieve a linear dependency on the effective horizon while BC has a quadratic dependency.
  • The authors would like to highlight that the result of the paper may shed some light for model-based reinforcement learning (MBRL).
  • Previous MBRL methods mostly involve a BC-like transition learning component that can cause a high model-bias.
  • The authors' analysis suggests that the BC-like transition learner can be replaced by a GAIL-style learner to improve the generalization ability, which partially addresses the reason that why GAIL-style environment model learning approach in [44, 43] can work well.
  • The authors hope this work will inspire future research in this direction
表格
  • Table1: List of f -divergences
  • Table2: Information about tasks in imitating policies
  • Table3: Key parameters of Behavioral Cloning
  • Table4: Key parameters of DAgger
  • Table5: Key parameters of GAIL, AIRL, FEM and GTAL
  • Table6: Discounted returns of learned policies. We use ± to denote the standard deviation γ = 0.9 γ = 0.99 γ = 0.999
Download tables as Excel
相关工作
  • In the domain of imitating policies, prior studies [39, 48, 40, 12] considered the finite-horizon setting and revealed that behavioral cloning [37] leads to the compounding errors (i.e., an optimality gap of O(T 2), where T is the horizon length). DAgger [40] improved this optimality gap to O(T ) at the cost of additional expert queries. Recently, based on generative adversarial network (GAN) [20], generative adversarial imitation learning [23] was proposed and had achieved much empirical success [17, 28, 29, 11]. Though many theoretical results have been established for GAN [5, 54, 3, 26], the theoretical properties of GAIL are not well understood. To the best of our knowledge, only until recently, there emerged studies towards understanding the generalization and computation properties of GAIL [13, 55]. The closest work to ours is [13], where the authors considered the generalization ability of GAIL under a finite-horizon setting with complete expert trajectories. In particular, they analyzed the generalization ability of the proposed R-distance but they did not provide the bound for policy value gap, which is of interest in practice. On the other hand, the global convergence properties with neural network function approximation were further analyzed in [55].
基金
  • This work is supported by National Key R&D Program of China (2018AAA0101100), NSFC (61876077), and Collaborative Innovation Center of Novel Software Technology and Industrialization
引用论文
  • Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML’04), pages 1–8, 2004.
    Google ScholarLocate open access versionFindings
  • Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17), 2017.
    Google ScholarLocate open access versionFindings
  • Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17), 2017.
    Google ScholarLocate open access versionFindings
  • Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, (ICML’17), pages 214–223, 2017.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning (ICML’17), pages 224–232, 2017.
    Google ScholarLocate open access versionFindings
  • Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in model-based reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML’18), pages 264–273, 2018.
    Google ScholarLocate open access versionFindings
  • Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):325–349, 2013.
    Google ScholarLocate open access versionFindings
  • Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML’17), pages 263–272, 2017.
    Google ScholarLocate open access versionFindings
  • Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems 30 (NeurIPS’17), pages 6240–6249, 2017.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv, 1606.01540, 2016.
    Findings
  • Xin-Qiang Cai, Yao-Xiang Ding, Yuan Jiang, and Zhi-Hua Zhou. Imitation learning from pixel-level demonstrations by hashreward. arXiv, 1909.03773, 2020.
    Findings
  • Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. Learning to search better than your teacher. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), pages 2058–2066, 2015.
    Google ScholarLocate open access versionFindings
  • Minshuo Chen, Yizhou Wang, Tianyi Liu, Zhuoran Yang, Xingguo Li, Zhaoran Wang, and Tuo Zhao. On computation and generalization of generative adversarial imitation learning. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20), 2020.
    Google ScholarLocate open access versionFindings
  • Felipe Codevilla, Matthias Miiller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA’18), pages 1–9, 2018.
    Google ScholarLocate open access versionFindings
  • Imre Csiszar and János Körner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
    Google ScholarFindings
  • Imre Csiszár and Paul C. Shields. Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1(4), 2004.
    Google ScholarLocate open access versionFindings
  • Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), 2018.
    Google ScholarLocate open access versionFindings
  • Seyed Kamyar Seyed Ghasemipour, Richard S. Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Conference on Robot Learning (CoRL’19), 2019.
    Google ScholarLocate open access versionFindings
  • Alessandro Giusti, Jerome Guzzi, Dan C. Ciresan, Fang-Lin He, Juan P. Rodriguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, Davide Scaramuzza, and Luca Maria Gambardella. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016.
    Google ScholarLocate open access versionFindings
  • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NeurIPS’14), pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems 30 (NeurIPS’17), pages 5767–5777, 2017.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML’18), pages 1856–1865, 2018.
    Google ScholarLocate open access versionFindings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29 (NeurIPS’16), pages 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Ho, Jayesh K. Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization. In Proceedings of the 33nd International Conference on Machine Learning (ICML’16), pages 2760–2769, 2016.
    Google ScholarLocate open access versionFindings
  • Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems 32 (NeurIPS’19), pages 12498–12509, 2019.
    Google ScholarLocate open access versionFindings
  • Haoming Jiang, Zhehui Chen, Minshuo Chen, Feng Liu, Dingding Wang, and Tuo Zhao. On computation and generalization of generative adversarial networks under spectrum control. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
    Google ScholarLocate open access versionFindings
  • Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
    Google ScholarLocate open access versionFindings
  • Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
    Google ScholarLocate open access versionFindings
  • Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20), 2020.
    Google ScholarLocate open access versionFindings
  • F. Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transaction on Information Theory, 52(10):4394–4412, 2006.
    Google ScholarLocate open access versionFindings
  • Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
    Google ScholarLocate open access versionFindings
  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
    Google ScholarFindings
  • Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
    Google ScholarLocate open access versionFindings
  • Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00), pages 663– 670, 2000.
    Google ScholarLocate open access versionFindings
  • Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29 (NeurIPS’16), pages 271–279, 2016.
    Google ScholarLocate open access versionFindings
  • Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019.
    Google ScholarLocate open access versionFindings
  • Dean Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
    Google ScholarLocate open access versionFindings
  • Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the 13rd International Conference on Artificial Intelligence and Statistics (AISTATS’10), pages 661–668, 2010.
    Google ScholarLocate open access versionFindings
  • Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), pages 627–635, 2011.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory To Algorithms. Cambridge University Press, 2014.
    Google ScholarFindings
  • Wenjie Shang, Yang Yu, Qingyang Li, Zhiwei Qin, Yiping Meng, and Jieping Ye. Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’19), pages 566–576, 2019.
    Google ScholarLocate open access versionFindings
  • Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and Anxiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’19), pages 4902–4909, 2019.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
    Google ScholarFindings
  • Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems 20 (NeurIPS’07), pages 1449–1456, 2007.
    Google ScholarLocate open access versionFindings
  • Umar Syed and Robert E. Schapire. A reduction from apprenticeship learning to classification. In Advances in Neural Information Processing Systems 23 (NeurIPS’10), pages 2253–2261, 2010.
    Google ScholarLocate open access versionFindings
  • Umar Syed, Michael H. Bowling, and Robert E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the 22nd International Conference on Machine Learning (ICML’08), pages 1032–1039, 2008.
    Google ScholarLocate open access versionFindings
  • Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), pages 4950–4957, 2018.
    Google ScholarLocate open access versionFindings
  • Arun Venkatraman, Martial Hebert, and J. Andrew Bagnell. Improving multi-step prediction of learned time series models. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15), pages 3024–3030, 2015.
    Google ScholarLocate open access versionFindings
  • Yang Yu. Towards sample efficient reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), pages 5739–5743, 2018.
    Google ScholarLocate open access versionFindings
  • Chao Zhang, Yang Yu, and Zhi-Hua Zhou. Learning environmental calibration actions for policy self-evolution. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), pages 3061–3067, 2018.
    Google ScholarLocate open access versionFindings
  • Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discrimination-generalization tradeoff in GANs. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), 2018.
    Google ScholarLocate open access versionFindings
  • Yufeng Zhang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Generative adversarial imitation learning with neural networks: Global optimality and convergence rate. arXiv, 2003.03709, 2020.
    Findings
  • 1. We first show that Gθ 1 is bounded as
    Google ScholarFindings
作者
Tian Xu
Tian Xu
Ziniu Li
Ziniu Li
Yang Yu
Yang Yu
您的评分 :
0

 

标签
评论
小科