AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present Cycle-Contrastive Learning, a novel self-supervised method for learning video representation

Cycle-Contrast for Self-Supervised Video Representation Learning

NIPS 2020, (2020)

Cited by: 0|Views35
EI
Full Text
Bibtex
Weibo

Abstract

We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation. Following a nature that there is a belong and inclusion relation of video and its frames, CCL is designed to find correspondences across frames and videos considering the contrastive representation in their domains respectively...More

Code:

Data:

0
Introduction
  • Self-supervised learning has achieved unignorable development in the domains of natural language processing and computer vision recently.
  • Most of existing works take temporal sequence ordering [20, 13, 36] or future frame prediction [8, 28, 18] as pre-text tasks for self-supervised learning of video representation, which assume that the nature of the correspondences across frames or clips could be generalized to represent a video
  • These methods give effective representations and decent results of downstream tasks, the authors suggest that utilizing other nature characteristics of video can lead to different yet representative video representations.
  • To make use of this nature, the authors propose Cycle-Contrastive Learning (CCL), a self-supervised method based on both cycle-consistency between video and its frames, and contrastive representations in each domain itself, in order to learning representations with
Highlights
  • Self-supervised learning has achieved unignorable development in the domains of natural language processing and computer vision recently
  • To make use of this nature, we propose Cycle-Contrastive Learning (CCL), a self-supervised method based on both cycle-consistency between video and its frames, and contrastive representations in each domain itself, in order to learning representations with
  • To verify that our proposed method can learn a good representation and can be transferred to downstream tasks by fine-tuning, we show a competitive result of CCL on two tasks related to nearest neighbour retrieval and action recognition, demonstrating that CCL can learn a general representation and significantly close the gap between the unsupervised and supervised video representation
  • The contributions of this paper can be concluded as : (i) We argue that video representation is structured over two domains, video and frame, and a good video representation is supposed to be closed across both domains yet distant to all the other videos and frames in corresponding domain, respectively. (ii) We design cycle-contrastive loss to learn video representation with the above desired properties, and our experiments suggest that learned representations lead to decent results of downstream tasks
  • The representation learned by CCL is still one nature of a good video representation owns
  • Various downstream tasks about video understanding based on machine learning can be benefit from a good video representation, such as action recognition/detection, sports action scoring and video recommendation system, by fine-tuning the self-supervise learned model to different tasks with less amount of annotated data
Methods
  • Top1 Top5 Top10 Top1 Top5 Top10 MSE CCL.
  • MSE COP[35](’19) SpeedNet[35](’20) CCL(ours).
  • The mini-batch is set to 48 videos and use the SGD optimizer with learning rate 0.0001.
  • The authors can check how the cycle-consistency is satisfied in the learned representations according to the nearest neighbour retrieval performance between the video and its frames, which are designed as task (A) and (B).
  • The performance of task (C) can be used for checking how well the contrastive feature is learned in the representations
Results
  • The authors report the Top-1 classification accuracy (%) on test split 1 of UCF101 and HMDB51, and compare with other fine-tuning all layers results from existing self-supervised methods in Table 4.
  • The network is initialized by CCL with Kinetics and fine-tuned with MMAct. Table 6 shows the f-measure of the approach and a VGG-16 based fully supervised method TSN [31] reported in [23].
  • The f-measure of CCL(FC) which only fine-tunes FC layer with fixed CCL initialized network is still lower than the random init one.
  • The f-measure is improved by +3.9 and +4.7 points compared with random, which shows that the representation learned by CCL is still general and helpful for transferring
Conclusion
  • The authors proposed a new self-supervised video representation learning method by finding belong and inclusion relations of video and its frames through cycle-contrastive loss.
  • Various downstream tasks about video understanding based on machine learning can be benefit from a good video representation, such as action recognition/detection, sports action scoring and video recommendation system, by fine-tuning the self-supervise learned model to different tasks with less amount of annotated data.
  • It will free the society from task specific data accumulation to focus more on the task design by utilizing a transferrable video representation
Tables
  • Table1: Network architecture considered in our experiments. Convolutional residual blocks are shown in brackets, next to the number of times each block is repeated in the stack. The dimensions of kernels are denoted by {T × H × W, C} for temporal, spatial height, width and channel sizes. The series of convolutions culminates with a global spatial-temporal pooling layer that yields a 512-dimensional feature vector as the video representation
  • Table2: Results of nearest neighbour retrieval on UCF101 of our proposal. F and V represent frame and video, the left side of ⇒ is used as query
  • Table3: Retrieval of frame- and video-level results on UCF101 compared with other methods
  • Table4: Comparison with other self-supervised video representation learning methods by finetuning all layers
  • Table5: Comparison with other self-supervised video representation learning methods under linear classification protocol
  • Table6: Comparison results on MMAct
  • Table7: Ablation on our cycle-contrastive loss. Top-1 accuracy for each method
Download tables as Excel
Related work
  • Self-supervised learning on video understanding. Self-supervised learning methods have been proposed to learn general video representations from unlabeled data in various works [4, 5, 3, 36]. Kim et al [13] and Xu et al [36] proposed to learn the video representation from a temporal order prediction pre-text task. Wang et al [30] proposed to capture information from both motion and appearance statistics along spatial and temporal dimensions to learn unlabeled video representation. Han et al [8] proposed a pre-text by predicting the future representations of clip from a video based on the recent past. Benaim et al [1] proposed a novel pre-text task by predicting the motion speed of objects in the video whether they move faster, at, or slower than their natural speed for learning the video representation. On the other hand, using external modality from video for self-supervised learning is also a typical way to learn a robust video representation. Sun et al [26] proposed to use video and text sequences for cross-modal learning in the self-training phase. Xiao et al [34] proposed to use slow and fast visual path networks that are deeply integrated with a faster audio path network to model vision and sound in a unified representation. Our work is only using vision modality for self-supervised learning. Different from the existed works with only vision modality, that almost focused on the correspondences across frames or clips, our work considers to find the correspondences across frame and video to learn the representation. Tschannen et al [27] proposed to apply different pre-text tasks on frame/shot (augmentation consistency) and video (future shot prediction consistency) respectively. However, CCL focuses on using the nature of belong and inclusion cycle-consistency relation across the frame and video with contrastive representations.
Study subjects and analysis
datasets: 4
where w1, w2, w3 are the balance parameters. In this section, we evaluate the effectiveness of our representation learning approach by using four datasets: Kinetics-400[12], UCF101[25], HMDB51[15] and MMAct[23] under standard evaluation protocols. The learned network backbones are evaluated via two tasks: nearest neighbor retrieval and action recognition

distinct subjects with 37 action classes under 4 fixed surveillance: 20
Both of them are 2 ̃5 fps videos. We also evaluate our method on MMAct dataset contains more than 36k video clips wtih 30 fps for 20 distinct subjects with 37 action classes under 4 fixed surveillance camera views for further generalizability checking. Kinetics-400 is a large-scale video action dataset containing 400 human action classes, with at least 400 video clips with 2 ̃5 fps for each action

action recognition datasets: 3
The main goal of unsupervised learning is to train a model that can be used for transferring to other supervised tasks. We use action recognition as the downstream task for evaluating the generalizability, thus we fine-tune our self-supervised network on three action recognition datasets UCF101, HMDB51 and MMAct. Shuffle&Learn [20](’16) VGAN [29](’16) Luo et al [19](’17) OPN [16](’17) Buchler et al [2](’18) MAS [30](’19) COP [36](’19) ST-puzzle [13](’19) DPC [8](’19) SpeedNet [1](’20) ImageNet pre-trained

Reference
  • Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W.T., Rubinstein, M., Irani, M., Dekel, T.: Speednet: Learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Google ScholarLocate open access versionFindings
  • Büchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: ECCV (2018)
    Google ScholarFindings
  • Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: CVPR (2019)
    Google ScholarFindings
  • Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)
    Google ScholarFindings
  • Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)
    Google ScholarFindings
  • Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
    Google ScholarFindings
  • Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)
    Google ScholarFindings
  • Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. ICCVW (2019)
    Google ScholarLocate open access versionFindings
  • He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. ArXiv abs/1911.05722 (2019)
    Findings
  • Jing, L., Tian, Y.: Self-supervised spatiotemporal feature learning by video geometric transformations. ArXiv abs/1811.11387 (2018)
    Findings
  • Kang, G., Jiang, L., Yang, Y., Hauptmann, A.G.: Contrastive adaptation network for unsupervised domain adaptation. In: CVPR (2019)
    Google ScholarFindings
  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, A., Suleyman, M., Zisserman, A.: The kinetics human action video dataset. ArXiv abs/1705.06950 (2017)
    Findings
  • Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: AAAI (2019)
    Google ScholarFindings
  • Kim, Y., Yoo, B., Kwak, Y., Choi, C., Kim, J.: Deep generative-contrastive networks for facial expression recognition. ArXiv abs/1703.07140 (2017)
    Findings
  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: A large video database for human motion recognition. In: ICCV (2011)
    Google ScholarFindings
  • Lee, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV (2017)
    Google ScholarFindings
  • Lin, Z., Feng, M., dos Santos, C.N., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured self-attentive sentence embedding. ArXiv abs/1703.03130 (2017)
    Findings
  • Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: ICLR (2017)
    Google ScholarFindings
  • Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: CVPR (2017)
    Google ScholarFindings
  • Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: Unsupervised learning using temporal order verification. In: ECCV (2016)
    Google ScholarFindings
  • Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)
    Google ScholarFindings
  • van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018), http://arxiv.org/abs/1807.03748
    Findings
  • Quan, K., Ziming, W., Ziwei, D., Klinkigt, M., Bin, T., Tomokazu, M.: Mmact: A large-scale dataset for cross modal human action understanding. In: ICCV (2019)
    Google ScholarFindings
  • Sermanet, P., Lynch, C., Hsu, J., Levine, S.: Time-contrastive networks: Self-supervised learning from multi-view observation. In: CVPRW (2017)
    Google ScholarFindings
  • Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv abs/1212.0402 (2012)
    Findings
  • Sun, C., Baradel, F., Murphy, K., Schmid, C.: Learning video representations using contrastive bidirectional transformer. arXiv: Learning (2019)
    Google ScholarFindings
  • Tschannen, M., Djolonga, J., Ritter, M., Mahendran, A., Houlsby, N., Gelly, S., Lucic, M.: Self-supervised learning of video-induced visual invariances. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Google ScholarLocate open access versionFindings
  • Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)
    Google ScholarFindings
  • Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS (2016)
    Google ScholarFindings
  • Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W.: Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: CVPR (2019)
    Google ScholarFindings
  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV (2016)
    Google ScholarFindings
  • Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)
    Google ScholarFindings
  • Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)
    Google ScholarFindings
  • Xiao, F., Lee, Y., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. ArXiv abs/2001.08740 (2020)
    Findings
  • Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: CVPR (2019)
    Google ScholarFindings
  • Xu, D., Xiao, J., Zhao, Z., Shao, J.B., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: CVPR (2019)
    Google ScholarFindings
  • Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q.X., Efros, A.A.: Learning dense correspondence via 3d-guided cycle consistency. In: CVPR (2016)
    Google ScholarFindings
  • Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycleconsistent adversarial networks. In: ICCV (2017)
    Google ScholarFindings
Author
Quan Kong
Quan Kong
Wenpeng Wei
Wenpeng Wei
Ziwei Deng
Ziwei Deng
Tomoaki Yoshinaga
Tomoaki Yoshinaga
Tomokazu Murakami
Tomokazu Murakami
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科