Domain Adaptive Imitation Learning

Kuno Kim
Kuno Kim
Yihong Gu
Yihong Gu

ICML, pp. 5286-5295, 2020.

Cited by: 0|Bibtex|Views47
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We study the question of how to imitate tasks across domains with discrepancies such as embodiment, viewpoint, and dynamics mismatch

Abstract:

We study the question of how to imitate tasks across domains with discrepancies such as embodiment, viewpoint, and dynamics mismatch. Many prior works require paired, aligned demonstrations and an additional RL step that requires environment interactions. However, paired, aligned demonstrations are seldom obtainable and RL procedures are ...More

Code:

Data:

0
Introduction
  • Humans possess an astonishing ability to recognize latent structural similarities between behaviors in related but distinct domains, and learn new skills from cross domain demonstrations alone.
  • Previous work in neuroscience (Marshall & Meltzoff, 2015) and robotics (Kuniyoshi & Inoue, 1993; Kuniyoshi et al, 1994) have recognized the pitfalls of exact behavioral cloning in the presence of domain discrepancies and posited that the effectiveness of the human imitation learning mechanism hinges on the ability to learn structure preserving domain correspondences
  • These correspondences enable the learner to internalize cross domain demonstrations and produce a reconstruction of the behavior in the self domain.
  • When the adult demonstrates running, the child is able internalize the demonstration, and reproduce the behavior
Highlights
  • Humans possess an astonishing ability to recognize latent structural similarities between behaviors in related but distinct domains, and learn new skills from cross domain demonstrations alone
  • To shed light on when Domain Adaptive Imitation Learning can be solved by alignment and adaptation, we introduce a theory of Markov Decision Process alignability
  • We propose an unsupervised Markov Decision Process alignment algorithm that succeeds at Domain Adaptive Imitation Learning from unpaired, unaligned demonstrations removing the need for costly paired, aligned data
  • We propose a unifying theoretical framework for imitation learning across domains with dynamics, embodiment, and/or viewpoint mismatch
  • We’ve formalized Cross Domain Imitation Learning which encompasses prior work in transfer learning across embodiment (Gupta et al, 2017) and viewpoint differences (Stadie et al, 2017; Liu et al, 2018) along with a practical algorithm that can be applied to both scenarios
  • While we’ve shown that Generative Adversarial Markov Decision Process Alignment empirically works well even when Markov Decision Process are not perfectly alignable, upcoming works may explore relaxing the conditions for Markov Decision Process alignability to develop a theory that covers an even wider range of real world Markov Decision Process
Methods
  • The authors' experiments were designed to answer the following questions: (1). Can GAMA uncover MDP reductions? (2).
  • Alignment complexity which is the number of MDP pairs, i.e number of tasks, in the alignment task set needed to learn alignments that enable zero-shot imitation, given ample cross domain demonstrations for the target tasks.
  • Adaptation complexity which is the amount of cross domain demonstrations for the target tasks needed to successfully imitate tasks in the self domain without querying the target task reward function, given a sufficiently large alignment task set.
  • Note that the authors include experiments with MDP pairs that are not perfectly alignable, D-R2P
Conclusion
  • The authors have formalized Cross Domain Imitation Learning which encompasses prior work in transfer learning across embodiment (Gupta et al, 2017) and viewpoint differences (Stadie et al, 2017; Liu et al, 2018) along with a practical algorithm that can be applied to both scenarios.
  • The authors hope to see future works develop principled ways design a minimal alignment task set, which is analogous to designing a minimal training set for supervised learning
Summary
  • Introduction:

    Humans possess an astonishing ability to recognize latent structural similarities between behaviors in related but distinct domains, and learn new skills from cross domain demonstrations alone.
  • Previous work in neuroscience (Marshall & Meltzoff, 2015) and robotics (Kuniyoshi & Inoue, 1993; Kuniyoshi et al, 1994) have recognized the pitfalls of exact behavioral cloning in the presence of domain discrepancies and posited that the effectiveness of the human imitation learning mechanism hinges on the ability to learn structure preserving domain correspondences
  • These correspondences enable the learner to internalize cross domain demonstrations and produce a reconstruction of the behavior in the self domain.
  • When the adult demonstrates running, the child is able internalize the demonstration, and reproduce the behavior
  • Methods:

    The authors' experiments were designed to answer the following questions: (1). Can GAMA uncover MDP reductions? (2).
  • Alignment complexity which is the number of MDP pairs, i.e number of tasks, in the alignment task set needed to learn alignments that enable zero-shot imitation, given ample cross domain demonstrations for the target tasks.
  • Adaptation complexity which is the amount of cross domain demonstrations for the target tasks needed to successfully imitate tasks in the self domain without querying the target task reward function, given a sufficiently large alignment task set.
  • Note that the authors include experiments with MDP pairs that are not perfectly alignable, D-R2P
  • Conclusion:

    The authors have formalized Cross Domain Imitation Learning which encompasses prior work in transfer learning across embodiment (Gupta et al, 2017) and viewpoint differences (Stadie et al, 2017; Liu et al, 2018) along with a practical algorithm that can be applied to both scenarios.
  • The authors hope to see future works develop principled ways design a minimal alignment task set, which is analogous to designing a minimal training set for supervised learning
Tables
  • Table1: MDP Alignment Performance. Mean2 loss between the learned state map predictions and the ground truth permutation. On average, GAMA has 17.3⇥ lower loss than the best baseline. Results are averaged across 5 seeds
Download tables as Excel
Related work
  • Closely related to DAIL, the field of cross domain transfer learning in the context of RL has explored approaches to use state maps to exploit cross domain demonstrations in a pretraining procedure for a new target task for which self domain reward function is available. Canonical Correlation Analysis (CCA) (Hotelling, 1936) finds invertible projections into a basis in which data from different domains are maximally correlated. These projections can then be composed to obtain a direct correspondence map between states. (Ammar et al, 2015; Joshi & Chowdhary, 2018) have utilized an unsupervised manifold alignment (UMA) algorithm which finds a linear map between states with similar local geometric properties. UMA assumes the existence of hand crafted features along with a distance metric between them. This family of work commonly uses a linear statemap to define a time-step wise transfer reward and executes an RL step on the new task. Similar to our work, these works use an alignment task set of unpaired, unaligned trajectories to compute the state map. Unlike these works, we learn maps that preserve MDP structure, use deep neural network state, action maps, and achieve zero-shot transfer to the new task without an RL step. More recent work in transfer learning across embodiment (Gupta et al, 2017) and viewpoint (Liu et al, 2018; Sermanet et al, 2018) mismatch obtain state correspondences from an alignment task set comprising paired, time-aligned demonstrations and use them to learn a state map or a state encoder to a domain invariant feature space. In contrast to this family of prior work, our approach learns both state, action maps from unpaired, unaligned demonstrations. Also, we remove the need for additional environment interactions and an expensive RL procedure on the target task by leveraging the action map for zero-shot imitation. (Stadie et al, 2017) have shown promise in using domain confusion loss and generative adversarial imitation learning (Ho & Ermon, 2016) for learning across small viewpoint mismatch without an alignment task set, but fails in dealing with large viewpoint differences. Unlike (Stadie et al, 2017), we leverage the alignment task set to succeed in imitating across larger view-
Funding
  • This research was supported by Sony, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024)
Reference
  • Ammar, H. B., Eaton, E., Ruvolo, P., and Taylor, M. E. An automated measure of mdp similarity for transfer in reinforcement learning. 2014.
    Google ScholarFindings
  • Ammar, H. B., Eaton, E., Ruvolo, P., and Taylor, M. E. Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment. 2015.
    Google ScholarFindings
  • Billingsley, P. Convergence of probability measures. 1968.
    Google ScholarFindings
  • Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Wojciech, Z. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Ferns, N., Panangaden, P., and Precup, D. Metrics for finite markov decision processes. In UAI, 2004.
    Google ScholarLocate open access versionFindings
  • Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. International Conference on Robotics and Automation, 2015.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Gupta, A., Devin, C., Liu, Y. X., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • Hotelling, H. Relations between two sets of variates. Biometrika, 28, 1936.
    Google ScholarLocate open access versionFindings
  • Jones, S. S. The development of imitation in infancy. Philos Trans R Soc Lond B Biol Sci., 364:2325–2335, 2009.
    Google ScholarLocate open access versionFindings
  • Joshi, G. and Chowdhary, G. Cross-domain transfer in reinforcement learning using target apprentice. 2018.
    Google ScholarFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, December 2014.
    Findings
  • Kuniyoshi, Y. and Inoue, H. Qualitative recognition of ongoing human action sequences. International Joint Conference on Artificial Intelligence, 1993.
    Google ScholarLocate open access versionFindings
  • Kuniyoshi, Y., Inaba, M., and Inoue, H. Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Trans. Robot. Autom., 10:799–822, 1994.
    Google ScholarLocate open access versionFindings
  • Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-toend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
    Google ScholarLocate open access versionFindings
  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509. 02971, 2015.
    Google ScholarFindings
  • Liu, F., Ling, Z., Mu, T., and Su, H. State alignment-based imitation learning. International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2018.
    Findings
  • Marshall, P. J. and Meltzoff, A. N. Body maps in the infant brain. Trends Cogn Sci., 19:499–505, 2015.
    Google ScholarLocate open access versionFindings
  • Muller, M. Dynamic time warping. Information retrieval for music and motion, pp. 69–84, 2007.
    Google ScholarFindings
  • Ortner, R. Combinations and mixtures of optimal policies in unichain markov decision processes are optimal. arXiv preprint arXiv:0508319, 2005.
    Google ScholarFindings
  • Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991. ISSN 0899-7667.
    Google ScholarLocate open access versionFindings
  • Ravindran, B. and Barto, A. G. Model minimization in hierarchical reinforcement learning. In SARA, 2002.
    Google ScholarLocate open access versionFindings
  • Rizzolatti, G. and Craighero, L. The mirror neuron system. Annual Review of Neuroscience, 27:169–192, 2004.
    Google ScholarLocate open access versionFindings
  • Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., and Levine, S. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888, 2018.
    Findings
  • Stadie, B., Abbeel, P., and Sutskever, I. Third person imitation learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. ACM, July 2008. ISBN 9781605582054. doi: 10.1145/1390156.1390286.
    Locate open access versionFindings
  • Wang, T., Liao, R., Ba, J., and Fidler, S. NerveNet: Learning structured policy with graph neural networks. International Conference on Learning Representations, 2018, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments