Fighting Copycat Agents in Behavioral Cloning from Observation Histories

Chuan Wen
Chuan Wen
Jierui Lin
Jierui Lin
Dinesh Jayaraman
Dinesh Jayaraman
Yang Gao
Yang Gao

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views50
Weibo:
Does our method improve performance over baseline approaches for behavioral cloning from observation histories?

Abstract:

Imitation learning trains policies to map from input observations to the actions that an expert would choose. In this setting, distribution shift frequently exacerbates the effect of misattributing expert actions to nuisance correlates among the observed variables. We observe that a common instance of this causal confusion occurs in parti...More

Code:

Data:

0
Introduction
  • Imitation learning is a simple, yet powerful paradigm for learning complex behaviors from expert demonstrations, with many successful applications ranging from autonomous driving to natural language generation [35, 40, 26, 28, 8, 16, 52].
  • Several prior works have reported that imitation from observation histories sometimes performs worse than imitation from a single frame alone [51, 26, 6]
  • To illustrate why this happens, consider the sequence of actions in an expert demonstration when it starts to drive in response to a red traffic light turning green (Figure 1).
  • The authors will sometimes refer to observation histories osimply as observations, for brevity
Highlights
  • Imitation learning is a simple, yet powerful paradigm for learning complex behaviors from expert demonstrations, with many successful applications ranging from autonomous driving to natural language generation [35, 40, 26, 28, 8, 16, 52]
  • We focus on a more specific, but widely prevalent form of causal confusion, where the previous action is the nuisance correlate in imitation learning from observation histories — the copycat problem
  • Does our method improve performance over baseline approaches for behavioral cloning from observation histories?
  • While the RNN hidden state can be thought of as implementing a natural information bottleneck, we find it performs or worse than the feedforward behavioral cloning (BC)-OH policies
  • While Target-Conditioned Adversary (TCA) cannot predict current action at as well as behavior cloning with observation histories (BC-OH), its performance is significantly better than the unconditional adversarial setting, indicating that the target-conditioning effectively preserves more information about the action
  • We identify the copycat problem that commonly afflicts imitation policies learning from histories of observations
Methods
  • The authors conduct experiments to evaluate the method against a variety of baselines.
  • The authors compare the method to the following baselines.
  • Behavioral cloning (BC-SO, BC-OH and RNN).
  • BC-SO is naive behavioral cloning (Sec 3) with H = 1, which does not suffer from the copycat problem, since it cannot infer the previous action.
  • It allows the agent to access more of the state information necessary for optimal action selection, but it is prone to the copycat problem.
  • The authors train BC-OH agents both with stacked inputs to a feedforward policy, and with sequential inputs to an RNN policy
Results
  • BC-OH, with observation histories, helps to varying extents in five out of six environments, but still performs much worse than the RL expert that generated the imitation trajectories.
  • DropoutBC [6] was originally proposed and evaluated in a setting where the nuisance correlate corresponded to a single dimension in the input
  • This is not true in the settings, where the nuisance variable is a function of the high-dimensional past observations.
Conclusion
  • The authors identify the copycat problem that commonly afflicts imitation policies learning from histories of observations.
  • The authors systematically study this phenomenon by carefully designing a set of diagnostic experiments, which shows the existence of this problem in multiple environments.
  • The causal confusion in image-based control is still an open question and the authors hope to address these more realistic scenarios in the future
Summary
  • Introduction:

    Imitation learning is a simple, yet powerful paradigm for learning complex behaviors from expert demonstrations, with many successful applications ranging from autonomous driving to natural language generation [35, 40, 26, 28, 8, 16, 52].
  • Several prior works have reported that imitation from observation histories sometimes performs worse than imitation from a single frame alone [51, 26, 6]
  • To illustrate why this happens, consider the sequence of actions in an expert demonstration when it starts to drive in response to a red traffic light turning green (Figure 1).
  • The authors will sometimes refer to observation histories osimply as observations, for brevity
  • Methods:

    The authors conduct experiments to evaluate the method against a variety of baselines.
  • The authors compare the method to the following baselines.
  • Behavioral cloning (BC-SO, BC-OH and RNN).
  • BC-SO is naive behavioral cloning (Sec 3) with H = 1, which does not suffer from the copycat problem, since it cannot infer the previous action.
  • It allows the agent to access more of the state information necessary for optimal action selection, but it is prone to the copycat problem.
  • The authors train BC-OH agents both with stacked inputs to a feedforward policy, and with sequential inputs to an RNN policy
  • Results:

    BC-OH, with observation histories, helps to varying extents in five out of six environments, but still performs much worse than the RL expert that generated the imitation trajectories.
  • DropoutBC [6] was originally proposed and evaluated in a setting where the nuisance correlate corresponded to a single dimension in the input
  • This is not true in the settings, where the nuisance variable is a function of the high-dimensional past observations.
  • Conclusion:

    The authors identify the copycat problem that commonly afflicts imitation policies learning from histories of observations.
  • The authors systematically study this phenomenon by carefully designing a set of diagnostic experiments, which shows the existence of this problem in multiple environments.
  • The causal confusion in image-based control is still an open question and the authors hope to address these more realistic scenarios in the future
Tables
  • Table1: MSE for next action prediction, conditioned on previous actions. The lower the error for a policy, the higher its tendency to generate actions that can be predicted from previous actions alone
  • Table2: Cumulative rewards per episode in partially observed (PO) environments. The top half of the table shows results in our offline imitation setting. The lower half shows methods that additionally interact with the environment, including accessing reinforcement learning rewards and queryable experts. CCIL cannot run on Ant and Humanoid because of their high-dimensional observations
  • Table3: The mean squared error for predicting the previous action at−1 from various features
  • Table4: The BC mean squared error (for predicting the next action at) on test data
  • Table5: Similar to Table 1, the predictability of the next action conditioned on past actions of our method and the BC-OH policy. Our method reduces the repetition of actions over time
Download tables as Excel
Related work
  • Imitation Learning. Imitation learning [30, 2], first proposed by Widrow and Smith in 1964 [53], is a powerful learning paradigm that enables the learning of complex behaviors from demonstrations. We focus on the widely used behavioral cloning paradigm [35, 40, 26, 28, 8, 16], which suffers from a well-known problem: small errors in the learned policy compound over time, leading quickly to states outside the training demonstration distribution, where performance deteriorates. One solution is to assume access to a queryable expert who prescribes actions in the new states encountered by the policy, as in the widely-used DAGGER algorithm [39] and others [46, 25, 47]. Another well-studied alternative is to refine the policy through environment interaction [21, 9].

    de Haan et al [12] explicitly connected distributional shift problems in imitation settings to nuisance correlations between input variables and expert actions, identifying the “causal confusion” problem. We isolate this causal confusion problem in its most frequently occurring form, the copycat problem motivated in Sec 1, encountered by ML practitioners within imitation learning [26, 6, 11, 51] and elsewhere, as “feedback loops” [42, 4]. We demonstrate a scalable solution to the copycat problem.
Funding
  • In our experiments, our approach improves performance significantly across a variety of partially observed imitation learning tasks
  • Does our method improve performance over baseline approaches for behavioral cloning from observation histories?
  • While TCA cannot predict current action at as well as BC-OH, its performance is significantly better than the unconditional adversarial setting, indicating that the target-conditioning effectively preserves more information about the next action
Reference
  • Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
    Findings
  • Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
    Google ScholarLocate open access versionFindings
  • Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Hk4_qw5xe.
    Locate open access versionFindings
  • Drew Bagnell. Feedback in machine learning, 2016. URL https://www.youtube.com/watch?v=XRSvz4UOpo4.
    Findings
  • Bram Bakker. Reinforcement learning by backpropagation through an lstm model/critic. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 127–134. IEEE, 2007.
    Google ScholarLocate open access versionFindings
  • Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018.
    Findings
  • Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016.
    Findings
  • Kiante Brantley, Wen Sun, and Mikael Henaff. Disagreement-Regularized Imitation Learning. International Conference in Learning Representations, pages 1–19, 2020.
    Google ScholarLocate open access versionFindings
  • Peter Bühlmann. Invariance, causality and robustness. arXiv preprint arXiv:1812.08233, 2018.
    Findings
  • Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 9329–9338, 2019.
    Google ScholarLocate open access versionFindings
  • Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pages 11693–11704, 2019.
    Google ScholarLocate open access versionFindings
  • Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1180–1189. JMLR.org, 2015.
    Google ScholarLocate open access versionFindings
  • Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
    Findings
  • Alessandro Giusti, Jérôme Guzzi, Dan C Ciresan, Fang-Lin He, Juan P Rodríguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2015.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Anirudh Goyal, Alex Lamb, Shagun Sodhani, Jordan Hoffmann, Sergey Levine, Yoshua Bengio, and Bernhard Scholkopf. Recurrent independent mechanisms, 2020. URL https://openreview.net/forum?id=BylaUTNtPS.
    Findings
  • Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527, 2015.
    Findings
  • Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469, 2017.
    Findings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4565–4573. Curran Associates, Inc., 2016.
    Google ScholarLocate open access versionFindings
  • Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
    Google ScholarFindings
  • Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Conference on Robot Learning, pages 143–156, 2017.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Urs Muller, Jan Ben, Eric Cosatto, and Beat Flepp. Off-road obstacle avoidance through end-to-end learning. Advances in Neural Information Processing Systems, pages 739–746, 2005. ISSN 10495258.
    Google ScholarLocate open access versionFindings
  • Nicolai Meinshausen. Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6–10. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.
    Google ScholarLocate open access versionFindings
  • Austin Nichols. Causal inference with observational data. The Stata Journal, 7(4):507–541, 2007.
    Google ScholarLocate open access versionFindings
  • T Osa, J Pajarinen, G Neumann, JA Bagnell, P Abbeel, and J Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179, 2018.
    Google ScholarLocate open access versionFindings
  • Vincent Pacelli and Anirudha Majumdar. Learning task-driven control policies via information bottlenecks. arXiv preprint arXiv:2002.01428, 2020.
    Findings
  • Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. 35th International Conference on Machine Learning, ICML 2018, 9(1):6432–6442, 2018.
    Google ScholarLocate open access versionFindings
  • Judea Pearl. Causality: Models, reasoning, and inference, second edition. Cambridge university press, 2011. ISBN 9780511803161. doi: 10.1017/CBO9780511803161.
    Findings
  • Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, November 2017.
    Google ScholarFindings
  • Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
    Google ScholarLocate open access versionFindings
  • Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340, 2019.
    Google ScholarLocate open access versionFindings
  • Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.
    Google ScholarLocate open access versionFindings
  • Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 15:627–635, 2011. ISSN 15324435.
    Google ScholarLocate open access versionFindings
  • Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. CoRR, abs/1610.04490, 2016. URL http://arxiv.org/abs/1610.04490.
    Findings
  • Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
    Google ScholarFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Wen Sun, Arun Venkatraman, Geoffrey J. Gordon, Byron Boots, and J. Andrew Bagnell. Deeply AggreVaTeD: Differentiable imitation learning for sequential prediction. 34th International Conference on Machine Learning, ICML 2017, 7:5090–5108, 2017.
    Google ScholarLocate open access versionFindings
  • Wen Sun, J. Andrew Bagnell, and Byron Boots. TRUNCATED HORIZON POLICY SEARCH: DEEP COMBINATION OF REINFORCEMENT AND IMITATION. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
    Findings
  • Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
    Google ScholarLocate open access versionFindings
  • Dequan Wang, Coline Devin, Qi-Zhi Cai, Philipp Krähenbühl, and Trevor Darrell. Monocular plan view networks for autonomous driving. In IROS, 2019.
    Google ScholarLocate open access versionFindings
  • Sean Welleck, Kianté Brantley, Hal Daumé III, and Kyunghyun Cho. Non-monotonic sequential text generation. arXiv preprint arXiv:1902.02192, 2019.
    Findings
  • Bernard Widrow and Fred W Smith. Pattern-recognizing control systems, 1964.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments