Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation

Corey Lynch
Corey Lynch
Jasmine Hsu
Jasmine Hsu

CVPR Workshops, 2017.

Cited by: 61|Bibtex|Views93
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose a self-supervised approach for learning representations of relationships between humans and their environment, including object interactions, attributes, and body pose, entirely from unlabeled videos recorded from multiple viewpoints

Abstract:

We propose a self-supervised approach for learning representations of relationships between humans and their environment, including object interactions, attributes, and body pose, entirely from unlabeled videos recorded from multiple viewpoints (Fig. 2). We train an embedding with a triplet loss that contrasts a pair of simultaneous frame...More

Code:

Data:

0
Introduction
  • Unsupervised Object Interactions The authors compare the multi-view TCN model against the Shuffle & Learn[1] approach, using the exact same architecture and only changing the loss and the last layer.
  • Both models are initialized with ImageNet classification weights, trained in a self-supervised manner using 15 minutes of multi-view pouring videos.
  • The authors test on 5 minutes of unseen pouring videos.
  • An off-the-shelf ImageNetpretrained Inception model is used as a baseline.
  • The authors propose a singleview TCN to compare with.
Highlights
  • Unsupervised Object Interactions We compare our multi-view TimeContrastive Networks (TCN) model against the Shuffle & Learn[1] approach, using the exact same architecture and only changing the loss and the last layer
  • Unsupervised Object Interactions We compare our multi-view TCN model against the Shuffle & Learn[1] approach, using the exact same architecture and only changing the loss and the last layer. Both models are initialized with ImageNet classification weights, trained in a self-supervised manner using 15 minutes of multi-view pouring videos
  • We find that TCN outperforms all baselines on different quantitative metrics (Table 1), and
  • Random or hard negative from temporal neighbors that multi-view outperforms the single-view model. Both metrics use the nearest neighbor of a reference frame in the embedding of each method
  • The attributes classification metric measures how well different attributes that are useful to perform a pouring task are modeled by different embeddings
Results
  • The authors find that TCN outperforms all baselines on different quantitative metrics (Table 1), and
  • ∗equal contribution †Google Brain Residency program (g.co/brainresidency) 1sermanet.github.io/tcn 2arxiv.org/abs/1704.06888
  • TC embedding deep network
  • Random or hard negative from temporal neighbors that multi-view outperforms the single-view model.
  • Both metrics use the nearest neighbor of a reference frame in the embedding of each method.
  • The alignment metric measures how well different sequences of a same demonstrations can be semantically aligned using different embeddings.
  • The attributes classification metric measures how well different attributes that are useful to perform a pouring task are modeled by different embeddings.
  • End-to-end Self-Supervised Pose Imitation The authors apply TCN to the problem of human pose imitation by a robot.
  • With an additional self-supervision signal (Fig. 4), the authors are able to produce end-to-end imitation without using any labels (Fig. 5).
  • The model is able to learn a complex human to robot mapping entirely self-supervised and is quantitatively better than a human-supervised imitation (Table 2).
  • The combination of all signals performs best.
  • Method Random Inception-ImageNet Shuffle & Learn[1] single-view TCN multi-view TCN
  • TC supervision self-supervision joints joints decoder human supervision deep network agent imitates
Conclusion
  • ImageNet-Incep.
  • Shuffle & Learn multi-view TCN
  • Supervision Random joints Self Human Human + Self TC + Self TC + Human TC + Human + Self
  • L2 robot joints error %
  • Reference Nearest Neighbors 1st-person 3rd-person view
Summary
  • Unsupervised Object Interactions The authors compare the multi-view TCN model against the Shuffle & Learn[1] approach, using the exact same architecture and only changing the loss and the last layer.
  • Both models are initialized with ImageNet classification weights, trained in a self-supervised manner using 15 minutes of multi-view pouring videos.
  • The authors test on 5 minutes of unseen pouring videos.
  • An off-the-shelf ImageNetpretrained Inception model is used as a baseline.
  • The authors propose a singleview TCN to compare with.
  • The authors find that TCN outperforms all baselines on different quantitative metrics (Table 1), and
  • ∗equal contribution †Google Brain Residency program (g.co/brainresidency) 1sermanet.github.io/tcn 2arxiv.org/abs/1704.06888
  • TC embedding deep network
  • Random or hard negative from temporal neighbors that multi-view outperforms the single-view model.
  • Both metrics use the nearest neighbor of a reference frame in the embedding of each method.
  • The alignment metric measures how well different sequences of a same demonstrations can be semantically aligned using different embeddings.
  • The attributes classification metric measures how well different attributes that are useful to perform a pouring task are modeled by different embeddings.
  • End-to-end Self-Supervised Pose Imitation The authors apply TCN to the problem of human pose imitation by a robot.
  • With an additional self-supervision signal (Fig. 4), the authors are able to produce end-to-end imitation without using any labels (Fig. 5).
  • The model is able to learn a complex human to robot mapping entirely self-supervised and is quantitatively better than a human-supervised imitation (Table 2).
  • The combination of all signals performs best.
  • Method Random Inception-ImageNet Shuffle & Learn[1] single-view TCN multi-view TCN
  • TC supervision self-supervision joints joints decoder human supervision deep network agent imitates
  • ImageNet-Incep.
  • Shuffle & Learn multi-view TCN
  • Supervision Random joints Self Human Human + Self TC + Self TC + Human TC + Human + Self
  • L2 robot joints error %
  • Reference Nearest Neighbors 1st-person 3rd-person view
Tables
  • Table1: Pouring alignment and classification errors: multiview TCN outperforms all baselines on both metrics. The classification error considers 5 classes related to pouring such as ”hand contact with recipient”, ”container angle”, ”liquid is flowing”, etc
  • Table2: Pose imitation error for different combinations of supervision signals. The error reported is the L2 robot joints distance between prediction and groundtruth, as a percentage error normalized by the possible range of each joint
Download tables as Excel
Reference
  • I. Misra, C. L. Zitnick, and M. Hebert. Unsupervised learning using sequential verification for action recognition. CoRR, abs/1603.08561, 2016.
    Findings
Full Text
Your rating :
0

 

Tags
Comments