Causal Imitation Learning With Unobserved Confounders

NIPS 2020, 2020.

被引用0|引用|浏览7
EI
其它链接dblp.uni-trier.de|academic.microsoft.com
微博一下
We investigate the imitation learning in the semantics of structural causal models

摘要

One of the common ways children learn is by mimicking adults. Imitation learning focuses on learning policies with suitable performance from demonstrations generated by an expert, with an unspecified performance measure, and unobserved reward signal. Popular methods for imitation learning start by either directly mimicking the behavior po...更多

代码

数据

0
简介
  • A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1].
  • Algorithms, and estimation methods have been developed to solve this problem [29, 36, 3, 6, 35, 32, 44]
  • In many applications, it is not clear which performance measure the demonstrator is optimizing.
  • The reward signal is not labeled and accessible in the observed expert’s trajectories
  • In such settings, the performance of candidate policies is not uniquely discernible from the observational data due to latent outcomes, even when infinitely many samples are gathered, complicating efforts to learn policy with satisfactory performance
重点内容
  • A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1]
  • (1) We introduce a complete graphical criterion for determining the feasibility of imitation from demonstration data and qualitative knowledge about the data-generating process represented as a causal graph
  • We introduce optimization procedures to solve for an imitating policy at Step 5 of IMITATE algorithm
  • We investigate the imitation learning in the semantics of structural causal models
  • We provide a graphical criterion that is complete for determining the feasibility of learning an imitating policy that mimics the expert’s performance
  • An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints
方法
  • The authors demonstrate the algorithms on several synthetic datasets, including highD [18] consisting of natural trajectories of human driven vehicles, and on MNIST digits.
  • The authors test the causal imitation method: the authors apply Thm. 2 when there exists an ⇡-backdoor admissible set; otherwise, Alg. 1 is used to leverage the observational distribution.
  • The authors found that the algorithms consistently imitate distributions over the expert’s reward in imitable (p-imitable) cases; and p-imitable instances commonly exist.
  • The authors obtain policies for the causal and naive imitators training two separate GANs. Distributions P (y|do(⇡)) induced by all algorithms are reported in Fig. 4b.
结论
  • The goal is to find an imitating policy that mimics the expert behaviors from combinations of demonstration data and qualitative knowledge about the data-generating process represented as a causal diagram.
  • The authors provide a graphical criterion that is complete for determining the feasibility of learning an imitating policy that mimics the expert’s performance.
  • An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints.
  • The authors propose a practical procedure for estimating such an imitating policy from observed trajectories of the expert’s demonstrations
总结
  • Introduction:

    A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1].
  • Algorithms, and estimation methods have been developed to solve this problem [29, 36, 3, 6, 35, 32, 44]
  • In many applications, it is not clear which performance measure the demonstrator is optimizing.
  • The reward signal is not labeled and accessible in the observed expert’s trajectories
  • In such settings, the performance of candidate policies is not uniquely discernible from the observational data due to latent outcomes, even when infinitely many samples are gathered, complicating efforts to learn policy with satisfactory performance
  • Objectives:

    The authors' goal is to learn an efficient policy to decide the value of an action variable X 2 O.
  • Methods:

    The authors demonstrate the algorithms on several synthetic datasets, including highD [18] consisting of natural trajectories of human driven vehicles, and on MNIST digits.
  • The authors test the causal imitation method: the authors apply Thm. 2 when there exists an ⇡-backdoor admissible set; otherwise, Alg. 1 is used to leverage the observational distribution.
  • The authors found that the algorithms consistently imitate distributions over the expert’s reward in imitable (p-imitable) cases; and p-imitable instances commonly exist.
  • The authors obtain policies for the causal and naive imitators training two separate GANs. Distributions P (y|do(⇡)) induced by all algorithms are reported in Fig. 4b.
  • Conclusion:

    The goal is to find an imitating policy that mimics the expert behaviors from combinations of demonstration data and qualitative knowledge about the data-generating process represented as a causal diagram.
  • The authors provide a graphical criterion that is complete for determining the feasibility of learning an imitating policy that mimics the expert’s performance.
  • An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints.
  • The authors propose a practical procedure for estimating such an imitating policy from observed trajectories of the expert’s demonstrations
基金
  • The authors were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER)
引用论文
  • P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.
    Google ScholarLocate open access versionFindings
  • B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
    Google ScholarLocate open access versionFindings
  • E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113:7345–7352, 2016.
    Google ScholarLocate open access versionFindings
  • A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Survey: Robot programming by demonstration. Handbook of robotics, 59(BOOK_CHAP), 2008.
    Google ScholarLocate open access versionFindings
  • J. Correa and E. Bareinboim. From statistical transportability to estimating the effect of stochastic interventions. In S. Kraus, editor, Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 1661–1667, Macao, China, 2019. International Joint Conferences on Artificial Intelligence Organization.
    Google ScholarLocate open access versionFindings
  • J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press.
    Google ScholarLocate open access versionFindings
  • P. de Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pages 11693–11704, 2019.
    Google ScholarLocate open access versionFindings
  • J. Etesami and P. Geiger. Causal transfer for imitation learning and decision making under sensor-shift. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press.
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. arXiv:1711.08936 [stat], Nov. 2017.
    Findings
  • O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, M. Sebag, A. Tritas, and P. Tubaro. Learning Functional Causal Models with Generative Neural Networks. arXiv:1709.05321 [stat], Sept. 2017.
    Findings
  • R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary-Seeking Generative Adversarial Networks. arXiv:1702.08431 [cs, stat], Feb. 2018.
    Findings
  • Y. Huang and M. Valtorta. Pearl’s calculus of intervention is complete. In R. Dechter and T. Richardson, editors, Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 217–224. AUAI Press, Corvallis, OR, 2006.
    Google ScholarLocate open access versionFindings
  • A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
    Google ScholarLocate open access versionFindings
  • E. B. Junzhe Zhang, Daniel Kumor. Causal imitation learning with unobserved confounders. Technical Report R-66, Causal AI Lab, Columbia University., 2020.
    Google ScholarFindings
  • M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training. arXiv:1709.02023 [cs, math, stat], Sept. 2017.
    Findings
  • D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
    Google ScholarFindings
  • R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein. The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2118–2125, 2018.
    Google ScholarLocate open access versionFindings
  • S. Lee and E. Bareinboim. Structural causal bandits with non-manipulable variables. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 4164–4172, Honolulu, Hawaii, 20AAAI Press.
    Google ScholarLocate open access versionFindings
  • S. Lee and E. Bareinboim. Causal effect identifiability under partial-observability. In Proceedings of the 37th International Conference on Machine Learning (ICML-20), 2020.
    Google ScholarLocate open access versionFindings
  • C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal Effect Inference with Deep Latent-Variable Models. arXiv:1705.08821 [cs, stat], May 2017.
    Findings
  • J. Mahler and K. Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on robot learning, pages 515–524, 2017.
    Google ScholarLocate open access versionFindings
  • U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun. Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pages 739–746, 2006.
    Google ScholarLocate open access versionFindings
  • K. Mülling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263– 279, 2013.
    Google ScholarLocate open access versionFindings
  • A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pages 663–670, 2000.
    Google ScholarLocate open access versionFindings
  • X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, Nov. 2010.
    Google ScholarLocate open access versionFindings
  • S. Nowozin, B. Cseke, and R. Tomioka. F-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv:1606.00709 [cs, stat], June 2016.
    Findings
  • T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179, 2018.
    Google ScholarLocate open access versionFindings
  • J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009.
    Google ScholarFindings
  • J. Pearl. Remarks on the method of propensity scores. Statistics in Medicine, 28:1415–1416, 2009. <http://ftp.cs.ucla.edu/pub/stat_ser/r345-sim.pdf>.
    Locate open access versionFindings
  • D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • P. Rosenbaum and D. Rubin. The central role of propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.
    Google ScholarLocate open access versionFindings
  • S. Russell and P. Norvig. Artificial intelligence: a modern approach. 2002.
    Google ScholarFindings
  • I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semiMarkovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, pages 1219–1226. 2006.
    Google ScholarLocate open access versionFindings
  • I. Shpitser and E. Sherman. Identification of personalized effects associated with causal pathways. In UAI, 2018.
    Google ScholarLocate open access versionFindings
  • P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, 2000.
    Google ScholarFindings
  • R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
    Google ScholarFindings
  • U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449–1456, 2008.
    Google ScholarLocate open access versionFindings
  • J. Tian. Studies in Causal Reasoning and Learning. PhD thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002.
    Google ScholarFindings
  • J. Tian. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 554–561, 2008.
    Google ScholarLocate open access versionFindings
  • J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 567–573, Menlo Park, CA, 2002. AAAI Press/The MIT Press.
    Google ScholarLocate open access versionFindings
  • J. Tian and J. Pearl. A general identification condition for causal effects. Technical Report R-290-A, Department of Computer Science, University of California, Los Angeles, CA, 2003.
    Google ScholarFindings
  • B. van der Zander, M. Liskiewicz, and J. Textor. Constructing separators and adjustment sets in ancestral graphs. In Proceedings of the UAI 2014 Conference on Causal Inference: Learning and Prediction-Volume 1274, pages 11–24, 2014.
    Google ScholarLocate open access versionFindings
  • C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
  • B. WIDROW. Pattern-recognizing control systems. Compurter and Information Sciences, 1964.
    Google ScholarLocate open access versionFindings
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
    Google ScholarLocate open access versionFindings
下载 PDF 全文
您的评分 :
0

 

标签
评论