Time-Contrastive Networks: Self-Supervised Learning from Video

Corey Lynch
Corey Lynch
Jasmine Hsu
Jasmine Hsu
Eric Jang
Eric Jang
Google Brain
Google Brain

ICRA, pp. 1134-1141, 2018.

Cited by: 165|Bibtex|Views239
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Robots in the real world would be capable of two things: learning the relevant attributes of an object interaction task purely from observation, and understanding how human poses and

Abstract:

We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behav...More

Code:

Data:

0
Introduction
  • While supervised learning has been successful on a range of tasks where labels can be specified by humans, such as object classification, many problems that arise in interactive applications like robotics are exceptionally difficult to supervise.
  • Robots in the real world would be capable of two things: learning the relevant attributes of an object interaction task purely from observation, and understanding how human poses and object interactions can be mapped onto the robot, in order to imitate directly from third-person video observations.
  • By learning from multi-view videos, the learned representations effectively disentangle functional attributes such as pose while being viewpoint and agent invariant.
  • The authors show how the robot can learn to link this visual representation to a corresponding motor command using either reinforcement learning or direct regression, effectively learning new tasks by observing humans
Highlights
  • While supervised learning has been successful on a range of tasks where labels can be specified by humans, such as object classification, many problems that arise in interactive applications like robotics are exceptionally difficult to supervise
  • Robots in the real world would be capable of two things: learning the relevant attributes of an object interaction task purely from observation, and understanding how human poses and
  • We show how the robot can learn to link this visual representation to a corresponding motor command using either reinforcement learning or direct regression, effectively learning new tasks by observing humans
  • The main contribution of our work is a representation learning algorithm that builds on top of existing semantically relevant features to produce a metric embedding that is sensitive to object interactions and pose, and insensitive to nuisance variables such as viewpoint and appearance. We demonstrate that this representation can be used to create a reward function for reinforcement learning of robotic skills, using only raw video demonstrations for supervision, and for direct imitation of human poses, without any explicit joint-level correspondence and again directly from raw video
  • The representation is learned by anchoring a temporally contrastive signal against co-occuring frames from other viewpoints, resulting in a representation that disambiguates temporal changes while providing invariance to viewpoint and other nuisance variables. We show that this representation can be used to provide a reward function within a reinforcement learning system for robotic object manipulation, and to provide mappings between human and robot poses to enable pose imitation directly from raw video
  • The training process requires a dataset of multi-viewpoint videos, once the TCN is trained, only a single raw video demonstration is used for imitation
Methods
  • 5) Quantitative Evaluation: The authors present two metrics in Table I to evaluate what the models are able to capture.
  • The alignment metric measures how well a model can semantically align two videos.
  • The classification metric measures how well a model can disentangle pouring-related attributes, that can be useful in a real robotic pouring task.
  • Given each frame of a video, each model has to pick the most semantically similar frame in another video.
  • The ”Random” baseline returns a random frame from the second video
Results
  • The authors offer multiple qualitative evaluations: k-Nearest Neighbors in Fig. 10, imitation strips in Fig. 21 and a t-SNE visualization in Fig. 14.
  • Imitation strips: In Fig. 21, the authors present an example of how the self-supervised model has learned to imitate the height level of humans by itself using the ”torso” joint.
  • More kNN examples, imitation strips and t-SNE visualizations from different models are available in Sec. F
Conclusion
  • The authors introduced a self-supervised representation learning method (TCN) based on multi-view video.
  • The representation is learned by anchoring a temporally contrastive signal against co-occuring frames from other viewpoints, resulting in a representation that disambiguates temporal changes while providing invariance to viewpoint and other nuisance variables
  • The authors show that this representation can be used to provide a reward function within a reinforcement learning system for robotic object manipulation, and to provide mappings between human and robot poses to enable pose imitation directly from raw video.
Summary
  • Introduction:

    While supervised learning has been successful on a range of tasks where labels can be specified by humans, such as object classification, many problems that arise in interactive applications like robotics are exceptionally difficult to supervise.
  • Robots in the real world would be capable of two things: learning the relevant attributes of an object interaction task purely from observation, and understanding how human poses and object interactions can be mapped onto the robot, in order to imitate directly from third-person video observations.
  • By learning from multi-view videos, the learned representations effectively disentangle functional attributes such as pose while being viewpoint and agent invariant.
  • The authors show how the robot can learn to link this visual representation to a corresponding motor command using either reinforcement learning or direct regression, effectively learning new tasks by observing humans
  • Objectives:

    If the authors aim to endow robots with wide repertoires of behavioral skills, being able to acquire those skills directly from third-person videos of humans would be dramatically more scalable.
  • The authors aim to learn an embedding f such that f − f.
  • In this work the authors aim to compare to realistic general off-the-shelf models that one might use without requiring new labels
  • Methods:

    5) Quantitative Evaluation: The authors present two metrics in Table I to evaluate what the models are able to capture.
  • The alignment metric measures how well a model can semantically align two videos.
  • The classification metric measures how well a model can disentangle pouring-related attributes, that can be useful in a real robotic pouring task.
  • Given each frame of a video, each model has to pick the most semantically similar frame in another video.
  • The ”Random” baseline returns a random frame from the second video
  • Results:

    The authors offer multiple qualitative evaluations: k-Nearest Neighbors in Fig. 10, imitation strips in Fig. 21 and a t-SNE visualization in Fig. 14.
  • Imitation strips: In Fig. 21, the authors present an example of how the self-supervised model has learned to imitate the height level of humans by itself using the ”torso” joint.
  • More kNN examples, imitation strips and t-SNE visualizations from different models are available in Sec. F
  • Conclusion:

    The authors introduced a self-supervised representation learning method (TCN) based on multi-view video.
  • The representation is learned by anchoring a temporally contrastive signal against co-occuring frames from other viewpoints, resulting in a representation that disambiguates temporal changes while providing invariance to viewpoint and other nuisance variables
  • The authors show that this representation can be used to provide a reward function within a reinforcement learning system for robotic object manipulation, and to provide mappings between human and robot poses to enable pose imitation directly from raw video.
Tables
  • Table1: Pouring alignment and classification errors: all models are selected at their lowest validation loss. The classification error considers 5 classes related to pouring detailed in Table II
  • Table2: Detailed attributes classification errors, for model selected by validation loss
  • Table3: Pouring alignment and classification errors: these models are selected using the classification score on a small labeled validation set, then ran on the full test set. We observe that multiview TCN outperforms other models with 15x shorter training time. The classification error considers 5 classes related to pouring: ”hand contact with recipient”, ”within pouring distance”, ”container angle”, ”liquid is flowing” and ”recipient fullness”
  • Table4: Imitation error for different combinations of supervision signals. The error reported is the joints distance between prediction and groundtruth. Note perfect imitation is not possible
Download tables as Excel
Related work
  • Imitation learning: Imitation learning [3] has been widely used for learning robotic skills from expert demonstrations [4, 5, 6, 7] and can be split into two areas: behavioral cloning and inverse reinforcement learning (IRL). Behavioral cloning considers a supervised learning problem, where examples of behaviors are provided as state-action pairs [8, 9]. IRL on the other hand uses expert demonstrations to learn a reward function that can be used to optimize an imitation policy with reinforcement learning [10]. Both types of imitation learning typically require the expert to provide demonstrations in the same context as the learner. In robotics, this might be accomplished by means of kinesthetic demonstrations [11] or teleoperation [12], but both methods require considerable operator expertise. If we aim to endow robots with wide repertoires of behavioral skills, being able to acquire those skills directly from third-person videos of humans would be dramatically more scalable. Recently, a range of works have studied the problem of imitating a demonstration observed in a different context, e.g. from a different viewpoint or an agent with a different embodiment, such as a human [13, 14, 15]. Liu et al [16] proposed to translate demonstrations between the expert and the learner contexts to learn an imitation policy by minimizing the distance to the translated demonstrations. However, Liu et al explicitly exclude from consideration any demonstrations with domain shift, where the demonstration is performed by a human and imitated by the robot with clear visual differences (e.g., human hands vs. robot grippers). In contrast, our TCN models are trained on a diverse range of demonstrations with different embodiments, objects, and backgrounds. This allows our TCN-based method to directly mimic human demonstrations, including demonstrations where a human pours liquid into a cup, and to mimic human poses without any explicit joint-level alignment. To our knowledge, our work is the first method for imitation of raw video demonstrations that can both mimic raw videos and handle the domain shift between human and robot embodiment. Label-free training signals: Label-free learning of visual representations promises to enable visual understanding from unsupervised data, and therefore has been explored extensively in recent years. Prior work in this area has studied unsupervised learning as a way of enabling supervised learning from small labeled datasets [17], image retrieval [18], and a variety of other tasks [19, 20, 21, 22]. In this paper, we focus specifically on representation learning for the purpose of model interactions between objects, humans, and their environment, which requires implicit modeling of a broad range of factors, such as functional relationships, while being invariant to nuisance variables such as viewpoint and appearance. Our method makes use of simultaneously recorded signals from multiple viewpoints to construct an image embedding. A number of prior works have used multiple modalities and temporal or spatial coherence to extract embeddings and features. For example, [23, 24] used co-occurrence of sounds and visual cues in videos to learn meaningful visual features. [20] also propose a multimodal approach for self-supervision by training a network for cross-channel input reconstruction. [25, 26] use the spatial coherence in images as a self-supervision signal and [27] use motion cues to self-supervise a segmentation task. These methods are more focused on spatial relationships, and the unsupervised signal they provide is complementary to the one explored in this work.
Reference
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
    Findings
  • B. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
    Google ScholarLocate open access versionFindings
  • J.A. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In ICRA, 2002.
    Google ScholarLocate open access versionFindings
  • N.D. Ratliff, J.A. Bagnell, and S.S. Srinivasa. Imitation learning for locomotion and manipulation. In Humanoids, 2007. ISBN 978-1-4244-1861-9.
    Google ScholarLocate open access versionFindings
  • K. Mulling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking movements in robot table tennis. In AAAI Fall Symposium: Robots Learning Interactively from Human Teachers, volume FS-12-07, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017.
    Findings
  • Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
    Google ScholarLocate open access versionFindings
  • S. Ross, G.J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, volume 15, pages 627–635, 2011.
    Google ScholarLocate open access versionFindings
  • P. Abbeel and A. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
    Google ScholarLocate open access versionFindings
  • S. Calinon, F. Guenter, and A. Billard. On learning, representing and generalizing a task in a humanoid robot. IEEE Trans. on Systems, Man and Cybernetics, Part B, 37(2):286– 298, 2007.
    Google ScholarLocate open access versionFindings
  • P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In ICRA, pages 763–768, 2009.
    Google ScholarLocate open access versionFindings
  • B.C. Stadie, P. Abbeel, and I. Sutskever. Third-person imitation learning. CoRR, abs/1703.01703, 2017.
    Findings
  • A. Dragan and S. Srinivasa. Online customization of teleoperation interfaces. In RO-MAN, pages 919–924, 2012.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. CoRR, abs/1612.06699, 2016.
    Findings
  • Y. Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. CoRR, abs/1707.03374, 2017.
    Findings
  • V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • M. Paulin, M. Douze, Z. Harchaoui, J. Mairal, F. Perronin, and C. Schmid. Local convolutional features with unsupervised training for image retrieval. In ICCV, pages 91–99, Dec 2015.
    Google ScholarLocate open access versionFindings
  • X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. CoRR, abs/1505.00687, 2015.
    Findings
  • R. Zhang, P. Isola, and A.A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. CoRR, abs/1611.09842, 2016.
    Findings
  • P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
    Google ScholarLocate open access versionFindings
  • V. Kumar, G. Carneiro, and I. D. Reid. Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • A. Owens, P. Isola, J.H. McDermott, A. Torralba, E.H. Adelson, and W.T. Freeman. Visually indicated sounds. CoRR, abs/1512.08512, 2015.
    Findings
  • Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. CoRR, abs/1610.09001, 2016.
    Findings
  • C. Doersch, A. Gupta, and A.A. Efros. Unsupervised visual representation learning by context prediction. CoRR, abs/1505.05192, 2015.
    Findings
  • S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In CVPR, pages 4353–4361, 2015.
    Google ScholarLocate open access versionFindings
  • D. Pathak, R.B. Girshick, P. Dollar, T. Darrell, and B. Hariharan. Learning features by watching objects move. CoRR, abs/1612.06370, 2016.
    Findings
  • L. Wiskott and T.J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Comput., 14(4): 715–770, April 2002. ISSN 0899-7667.
    Google ScholarLocate open access versionFindings
  • R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally coherent metrics. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould. Selfsupervised video representation learning with odd-one-out networks. CoRR, abs/1611.06646, 2016.
    Findings
  • I. Misra, C.L. Zitnick, and Martial Hebert. Unsupervised learning using sequential verification for action recognition. CoRR, abs/1603.08561, 2016.
    Findings
  • K. Moo Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: learned invariant feature transform. CoRR, abs/1603.09114, 2016.
    Findings
  • E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • W.F. Whitney, M. Chang, T.D. Kulkarni, and J.B. Tenenbaum. Understanding visual concepts with continuation learning. CoRR, abs/1602.06822, 2016.
    Findings
  • M. Mathieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. CoRR, abs/1511.05440, 2015.
    Findings
  • G. Mori, C. Pantofaru, N. Kothari, T. Leung, G. Toderici, A. Toshev, and W. Yang. Pose embeddings: A deep architecture for learning to match human poses. CoRR, abs/1507.00302, 2015.
    Findings
  • R. Stewart and S. Ermon. Label-free supervision of neural networks with physics and domain knowledge. CoRR, abs/1609.05566, 2016.
    Findings
  • V. Caggiano, L. Fogassi, G. Rizzolatti, J.K. Pomper, P. Thier, M.A. Giese, and A. Casile. View-based encoding of actions in mirror neurons of area f5 in macaque premotor cortex. Current Biology, 21(2):144–148, 2011.
    Google ScholarLocate open access versionFindings
  • G. Rizzolatti and L. Craighero. The mirror-neuron system. Annual Review of Neuroscience, 27:169–192, 2004.
    Google ScholarLocate open access versionFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
    Findings
  • Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • C. Finn, X. Yu Tan, Y. Duan, Y. Darrell, S. Levine, and P. Abbeel. Learning visual feature spaces for robotic manipulation with deep spatial autoencoders. CoRR, abs/1509.06113, 2015.
    Findings
  • K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
    Google ScholarLocate open access versionFindings
  • H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. CoRR, abs/1511.06452, 2015.
    Findings
  • E. Coumans and Y. Bai. pybullet, a python module for physics simulation in robotics, games and machine learning. http://pybullet.org/, 2016–2017.
    Findings
  • S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors. In IROS, 2012.
    Google ScholarLocate open access versionFindings
  • E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. JMLR, 11, 2010.
    Google ScholarLocate open access versionFindings
  • Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine. Path integral guided policy search. In ICRA, 2017.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments