Causal Imitation Learning With Unobserved Confounders
NIPS 2020, 2020.
EI
微博一下:
摘要:
One of the common ways children learn is by mimicking adults. Imitation learning focuses on learning policies with suitable performance from demonstrations generated by an expert, with an unspecified performance measure, and unobserved reward signal. Popular methods for imitation learning start by either directly mimicking the behavior po...更多
代码:
数据:
简介
- A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1].
- Algorithms, and estimation methods have been developed to solve this problem [29, 36, 3, 6, 35, 32, 44]
- In many applications, it is not clear which performance measure the demonstrator is optimizing.
- The reward signal is not labeled and accessible in the observed expert’s trajectories
- In such settings, the performance of candidate policies is not uniquely discernible from the observational data due to latent outcomes, even when infinitely many samples are gathered, complicating efforts to learn policy with satisfactory performance
重点内容
- A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1]
- (1) We introduce a complete graphical criterion for determining the feasibility of imitation from demonstration data and qualitative knowledge about the data-generating process represented as a causal graph
- We introduce optimization procedures to solve for an imitating policy at Step 5 of IMITATE algorithm
- We investigate the imitation learning in the semantics of structural causal models
- We provide a graphical criterion that is complete for determining the feasibility of learning an imitating policy that mimics the expert’s performance
- An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints
方法
- The authors demonstrate the algorithms on several synthetic datasets, including highD [18] consisting of natural trajectories of human driven vehicles, and on MNIST digits.
- The authors test the causal imitation method: the authors apply Thm. 2 when there exists an ⇡-backdoor admissible set; otherwise, Alg. 1 is used to leverage the observational distribution.
- The authors found that the algorithms consistently imitate distributions over the expert’s reward in imitable (p-imitable) cases; and p-imitable instances commonly exist.
- The authors obtain policies for the causal and naive imitators training two separate GANs. Distributions P (y|do(⇡)) induced by all algorithms are reported in Fig. 4b.
结论
- The goal is to find an imitating policy that mimics the expert behaviors from combinations of demonstration data and qualitative knowledge about the data-generating process represented as a causal diagram.
- The authors provide a graphical criterion that is complete for determining the feasibility of learning an imitating policy that mimics the expert’s performance.
- An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints.
- The authors propose a practical procedure for estimating such an imitating policy from observed trajectories of the expert’s demonstrations
总结
Introduction:
A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1].- Algorithms, and estimation methods have been developed to solve this problem [29, 36, 3, 6, 35, 32, 44]
- In many applications, it is not clear which performance measure the demonstrator is optimizing.
- The reward signal is not labeled and accessible in the observed expert’s trajectories
- In such settings, the performance of candidate policies is not uniquely discernible from the observational data due to latent outcomes, even when infinitely many samples are gathered, complicating efforts to learn policy with satisfactory performance
Objectives:
The authors' goal is to learn an efficient policy to decide the value of an action variable X 2 O.Methods:
The authors demonstrate the algorithms on several synthetic datasets, including highD [18] consisting of natural trajectories of human driven vehicles, and on MNIST digits.- The authors test the causal imitation method: the authors apply Thm. 2 when there exists an ⇡-backdoor admissible set; otherwise, Alg. 1 is used to leverage the observational distribution.
- The authors found that the algorithms consistently imitate distributions over the expert’s reward in imitable (p-imitable) cases; and p-imitable instances commonly exist.
- The authors obtain policies for the causal and naive imitators training two separate GANs. Distributions P (y|do(⇡)) induced by all algorithms are reported in Fig. 4b.
Conclusion:
The goal is to find an imitating policy that mimics the expert behaviors from combinations of demonstration data and qualitative knowledge about the data-generating process represented as a causal diagram.- The authors provide a graphical criterion that is complete for determining the feasibility of learning an imitating policy that mimics the expert’s performance.
- An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints.
- The authors propose a practical procedure for estimating such an imitating policy from observed trajectories of the expert’s demonstrations
基金
- The authors were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER)
引用论文
- P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.
- B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113:7345–7352, 2016.
- A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Survey: Robot programming by demonstration. Handbook of robotics, 59(BOOK_CHAP), 2008.
- J. Correa and E. Bareinboim. From statistical transportability to estimating the effect of stochastic interventions. In S. Kraus, editor, Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 1661–1667, Macao, China, 2019. International Joint Conferences on Artificial Intelligence Organization.
- J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press.
- P. de Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pages 11693–11704, 2019.
- J. Etesami and P. Geiger. Causal transfer for imitation learning and decision making under sensor-shift. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. arXiv:1711.08936 [stat], Nov. 2017.
- O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, M. Sebag, A. Tritas, and P. Tubaro. Learning Functional Causal Models with Generative Neural Networks. arXiv:1709.05321 [stat], Sept. 2017.
- R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary-Seeking Generative Adversarial Networks. arXiv:1702.08431 [cs, stat], Feb. 2018.
- Y. Huang and M. Valtorta. Pearl’s calculus of intervention is complete. In R. Dechter and T. Richardson, editors, Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 217–224. AUAI Press, Corvallis, OR, 2006.
- A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
- E. B. Junzhe Zhang, Daniel Kumor. Causal imitation learning with unobserved confounders. Technical Report R-66, Causal AI Lab, Columbia University., 2020.
- M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training. arXiv:1709.02023 [cs, math, stat], Sept. 2017.
- D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
- R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein. The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2118–2125, 2018.
- S. Lee and E. Bareinboim. Structural causal bandits with non-manipulable variables. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 4164–4172, Honolulu, Hawaii, 20AAAI Press.
- S. Lee and E. Bareinboim. Causal effect identifiability under partial-observability. In Proceedings of the 37th International Conference on Machine Learning (ICML-20), 2020.
- C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal Effect Inference with Deep Latent-Variable Models. arXiv:1705.08821 [cs, stat], May 2017.
- J. Mahler and K. Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on robot learning, pages 515–524, 2017.
- U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun. Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pages 739–746, 2006.
- K. Mülling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263– 279, 2013.
- A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pages 663–670, 2000.
- X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, Nov. 2010.
- S. Nowozin, B. Cseke, and R. Tomioka. F-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv:1606.00709 [cs, stat], June 2016.
- T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179, 2018.
- J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009.
- J. Pearl. Remarks on the method of propensity scores. Statistics in Medicine, 28:1415–1416, 2009. <http://ftp.cs.ucla.edu/pub/stat_ser/r345-sim.pdf>.
- D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
- P. Rosenbaum and D. Rubin. The central role of propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.
- S. Russell and P. Norvig. Artificial intelligence: a modern approach. 2002.
- I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semiMarkovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, pages 1219–1226. 2006.
- I. Shpitser and E. Sherman. Identification of personalized effects associated with causal pathways. In UAI, 2018.
- P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, 2000.
- R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
- U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449–1456, 2008.
- J. Tian. Studies in Causal Reasoning and Learning. PhD thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002.
- J. Tian. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 554–561, 2008.
- J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 567–573, Menlo Park, CA, 2002. AAAI Press/The MIT Press.
- J. Tian and J. Pearl. A general identification condition for causal effects. Technical Report R-290-A, Department of Computer Science, University of California, Los Angeles, CA, 2003.
- B. van der Zander, M. Liskiewicz, and J. Textor. Constructing separators and adjustment sets in ancestral graphs. In Proceedings of the UAI 2014 Conference on Causal Inference: Learning and Prediction-Volume 1274, pages 11–24, 2014.
- C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- B. WIDROW. Pattern-recognizing control systems. Compurter and Information Sciences, 1964.
- B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
下载 PDF 全文
标签
评论