Fighting Copycat Agents in Behavioral Cloning from Observation Histories
NIPS 2020, 2020.
Weibo:
Abstract:
Imitation learning trains policies to map from input observations to the actions that an expert would choose. In this setting, distribution shift frequently exacerbates the effect of misattributing expert actions to nuisance correlates among the observed variables. We observe that a common instance of this causal confusion occurs in parti...More
Code:
Data:
Introduction
- Imitation learning is a simple, yet powerful paradigm for learning complex behaviors from expert demonstrations, with many successful applications ranging from autonomous driving to natural language generation [35, 40, 26, 28, 8, 16, 52].
- Several prior works have reported that imitation from observation histories sometimes performs worse than imitation from a single frame alone [51, 26, 6]
- To illustrate why this happens, consider the sequence of actions in an expert demonstration when it starts to drive in response to a red traffic light turning green (Figure 1).
- The authors will sometimes refer to observation histories osimply as observations, for brevity
Highlights
- Imitation learning is a simple, yet powerful paradigm for learning complex behaviors from expert demonstrations, with many successful applications ranging from autonomous driving to natural language generation [35, 40, 26, 28, 8, 16, 52]
- We focus on a more specific, but widely prevalent form of causal confusion, where the previous action is the nuisance correlate in imitation learning from observation histories — the copycat problem
- Does our method improve performance over baseline approaches for behavioral cloning from observation histories?
- While the RNN hidden state can be thought of as implementing a natural information bottleneck, we find it performs or worse than the feedforward behavioral cloning (BC)-OH policies
- While Target-Conditioned Adversary (TCA) cannot predict current action at as well as behavior cloning with observation histories (BC-OH), its performance is significantly better than the unconditional adversarial setting, indicating that the target-conditioning effectively preserves more information about the action
- We identify the copycat problem that commonly afflicts imitation policies learning from histories of observations
Methods
- The authors conduct experiments to evaluate the method against a variety of baselines.
- The authors compare the method to the following baselines.
- Behavioral cloning (BC-SO, BC-OH and RNN).
- BC-SO is naive behavioral cloning (Sec 3) with H = 1, which does not suffer from the copycat problem, since it cannot infer the previous action.
- It allows the agent to access more of the state information necessary for optimal action selection, but it is prone to the copycat problem.
- The authors train BC-OH agents both with stacked inputs to a feedforward policy, and with sequential inputs to an RNN policy
Results
- BC-OH, with observation histories, helps to varying extents in five out of six environments, but still performs much worse than the RL expert that generated the imitation trajectories.
- DropoutBC [6] was originally proposed and evaluated in a setting where the nuisance correlate corresponded to a single dimension in the input
- This is not true in the settings, where the nuisance variable is a function of the high-dimensional past observations.
Conclusion
- The authors identify the copycat problem that commonly afflicts imitation policies learning from histories of observations.
- The authors systematically study this phenomenon by carefully designing a set of diagnostic experiments, which shows the existence of this problem in multiple environments.
- The causal confusion in image-based control is still an open question and the authors hope to address these more realistic scenarios in the future
Summary
Introduction:
Imitation learning is a simple, yet powerful paradigm for learning complex behaviors from expert demonstrations, with many successful applications ranging from autonomous driving to natural language generation [35, 40, 26, 28, 8, 16, 52].- Several prior works have reported that imitation from observation histories sometimes performs worse than imitation from a single frame alone [51, 26, 6]
- To illustrate why this happens, consider the sequence of actions in an expert demonstration when it starts to drive in response to a red traffic light turning green (Figure 1).
- The authors will sometimes refer to observation histories osimply as observations, for brevity
Methods:
The authors conduct experiments to evaluate the method against a variety of baselines.- The authors compare the method to the following baselines.
- Behavioral cloning (BC-SO, BC-OH and RNN).
- BC-SO is naive behavioral cloning (Sec 3) with H = 1, which does not suffer from the copycat problem, since it cannot infer the previous action.
- It allows the agent to access more of the state information necessary for optimal action selection, but it is prone to the copycat problem.
- The authors train BC-OH agents both with stacked inputs to a feedforward policy, and with sequential inputs to an RNN policy
Results:
BC-OH, with observation histories, helps to varying extents in five out of six environments, but still performs much worse than the RL expert that generated the imitation trajectories.- DropoutBC [6] was originally proposed and evaluated in a setting where the nuisance correlate corresponded to a single dimension in the input
- This is not true in the settings, where the nuisance variable is a function of the high-dimensional past observations.
Conclusion:
The authors identify the copycat problem that commonly afflicts imitation policies learning from histories of observations.- The authors systematically study this phenomenon by carefully designing a set of diagnostic experiments, which shows the existence of this problem in multiple environments.
- The causal confusion in image-based control is still an open question and the authors hope to address these more realistic scenarios in the future
Tables
- Table1: MSE for next action prediction, conditioned on previous actions. The lower the error for a policy, the higher its tendency to generate actions that can be predicted from previous actions alone
- Table2: Cumulative rewards per episode in partially observed (PO) environments. The top half of the table shows results in our offline imitation setting. The lower half shows methods that additionally interact with the environment, including accessing reinforcement learning rewards and queryable experts. CCIL cannot run on Ant and Humanoid because of their high-dimensional observations
- Table3: The mean squared error for predicting the previous action at−1 from various features
- Table4: The BC mean squared error (for predicting the next action at) on test data
- Table5: Similar to Table 1, the predictability of the next action conditioned on past actions of our method and the BC-OH policy. Our method reduces the repetition of actions over time
Related work
- Imitation Learning. Imitation learning [30, 2], first proposed by Widrow and Smith in 1964 [53], is a powerful learning paradigm that enables the learning of complex behaviors from demonstrations. We focus on the widely used behavioral cloning paradigm [35, 40, 26, 28, 8, 16], which suffers from a well-known problem: small errors in the learned policy compound over time, leading quickly to states outside the training demonstration distribution, where performance deteriorates. One solution is to assume access to a queryable expert who prescribes actions in the new states encountered by the policy, as in the widely-used DAGGER algorithm [39] and others [46, 25, 47]. Another well-studied alternative is to refine the policy through environment interaction [21, 9].
de Haan et al [12] explicitly connected distributional shift problems in imitation settings to nuisance correlations between input variables and expert actions, identifying the “causal confusion” problem. We isolate this causal confusion problem in its most frequently occurring form, the copycat problem motivated in Sec 1, encountered by ML practitioners within imitation learning [26, 6, 11, 51] and elsewhere, as “feedback loops” [42, 4]. We demonstrate a scalable solution to the copycat problem.
Funding
- In our experiments, our approach improves performance significantly across a variety of partially observed imitation learning tasks
- Does our method improve performance over baseline approaches for behavioral cloning from observation histories?
- While TCA cannot predict current action at as well as BC-OH, its performance is significantly better than the unconditional adversarial setting, indicating that the target-conditioning effectively preserves more information about the next action
Reference
- Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Hk4_qw5xe.
- Drew Bagnell. Feedback in machine learning, 2016. URL https://www.youtube.com/watch?v=XRSvz4UOpo4.
- Bram Bakker. Reinforcement learning by backpropagation through an lstm model/critic. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 127–134. IEEE, 2007.
- Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018.
- Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020.
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016.
- Kiante Brantley, Wen Sun, and Mikael Henaff. Disagreement-Regularized Imitation Learning. International Conference in Learning Representations, pages 1–19, 2020.
- Peter Bühlmann. Invariance, causality and robustness. arXiv preprint arXiv:1812.08233, 2018.
- Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 9329–9338, 2019.
- Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pages 11693–11704, 2019.
- Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1180–1189. JMLR.org, 2015.
- Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
- Alessandro Giusti, Jérôme Guzzi, Dan C Ciresan, Fang-Lin He, Juan P Rodríguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2015.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Anirudh Goyal, Alex Lamb, Shagun Sodhani, Jordan Hoffmann, Sergey Levine, Yoshua Bengio, and Bernhard Scholkopf. Recurrent independent mechanisms, 2020. URL https://openreview.net/forum?id=BylaUTNtPS.
- Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527, 2015.
- Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469, 2017.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4565–4573. Curran Associates, Inc., 2016.
- Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
- Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
- Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
- Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Conference on Robot Learning, pages 143–156, 2017.
- Yann LeCun, Urs Muller, Jan Ben, Eric Cosatto, and Beat Flepp. Off-road obstacle avoidance through end-to-end learning. Advances in Neural Information Processing Systems, pages 739–746, 2005. ISSN 10495258.
- Nicolai Meinshausen. Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6–10. IEEE, 2018.
- Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.
- Austin Nichols. Causal inference with observational data. The Stata Journal, 7(4):507–541, 2007.
- T Osa, J Pajarinen, G Neumann, JA Bagnell, P Abbeel, and J Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179, 2018.
- Vincent Pacelli and Anirudha Majumdar. Learning task-driven control policies via information bottlenecks. arXiv preprint arXiv:2002.01428, 2020.
- Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. 35th International Conference on Machine Learning, ICML 2018, 9(1):6432–6442, 2018.
- Judea Pearl. Causality: Models, reasoning, and inference, second edition. Cambridge university press, 2011. ISBN 9780511803161. doi: 10.1017/CBO9780511803161.
- Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, November 2017.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
- Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340, 2019.
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.
- Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 15:627–635, 2011. ISSN 15324435.
- Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). IEEE, 2014.
- Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. CoRR, abs/1610.04490, 2016. URL http://arxiv.org/abs/1610.04490.
- Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Wen Sun, Arun Venkatraman, Geoffrey J. Gordon, Byron Boots, and J. Andrew Bagnell. Deeply AggreVaTeD: Differentiable imitation learning for sequential prediction. 34th International Conference on Machine Learning, ICML 2017, 7:5090–5108, 2017.
- Wen Sun, J. Andrew Bagnell, and Byron Boots. TRUNCATED HORIZON POLICY SEARCH: DEEP COMBINATION OF REINFORCEMENT AND IMITATION. In International Conference on Learning Representations, 2018.
- Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
- Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
- Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
- Dequan Wang, Coline Devin, Qi-Zhi Cai, Philipp Krähenbühl, and Trevor Darrell. Monocular plan view networks for autonomous driving. In IROS, 2019.
- Sean Welleck, Kianté Brantley, Hal Daumé III, and Kyunghyun Cho. Non-monotonic sequential text generation. arXiv preprint arXiv:1902.02192, 2019.
- Bernard Widrow and Fred W Smith. Pattern-recognizing control systems, 1964.
Full Text
Tags
Comments