AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
As will be demonstrated in Section 4.4, we evaluate the representation on a linear probe protocol, and observe a significant performance gap between training on Info Noise Contrastive Estimation and UberNCE, confirming that instance discrimination is not making the best use of da...

Self-supervised Co-Training for Video Representation Learning

NIPS 2020, (2020)

被引用0|浏览55
EI
下载 PDF 全文
引用
微博一下

摘要

The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clea...更多
0
简介
  • The recent progress in self-supervised representation learning for images and videos has demonstrated the benefits of using a discriminative contrastive loss on data samples [12, 13, 27, 28, 45, 59], such as NCE [24, 34].
  • The transformations can be artificial, such as those used in data augmentation [12], or natural, such as those arising in videos from temporal segments within the same clip
  • In essence, these pretext tasks focus on instance discrimination: each data sample is treated as a ‘class’, and the objective is to discriminate its own augmented version from a large number of other data samples or their augmented versions.
  • In the experiments, training with UberNCE outperforms the supervised model trained with cross-entropy, a phenomenon that is observed in a concurrent work [36] for image classification
重点内容
  • The recent progress in self-supervised representation learning for images and videos has demonstrated the benefits of using a discriminative contrastive loss on data samples [12, 13, 27, 28, 45, 59], such as NCE [24, 34]
  • We target self-supervised video representation learning, and ask the question: is instance discrimination making the best use of data? We show that the answer is no, in two respects: First, we show that hard positives are being neglected in the self-supervised training, and that if these hard positives are included the quality of learnt representation improves significantly
  • We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or from both, and make the following contributions: (i) we show that an oracle with access to semantic class labels improves the performance of instance-based contrastive learning; (ii) we propose a novel self-supervised co-training scheme, CoCLR, to improve the training regime of the popular Info Noise Contrastive Estimation (InfoNCE), exploiting the complementary information from different views of the same data source; and (iii) we thoroughly evaluate the quality of the learnt representation on two downstream tasks, video action recognition and retrieval, on UCF101 and HMDB51
  • As will be demonstrated in Section 4.4, we evaluate the representation on a linear probe protocol, and observe a significant performance gap between training on InfoNCE and UberNCE, confirming that instance discrimination is not making the best use of data
  • When compared with our previous work that used InfoNCE for video selfsupervision, DPC and MemDPC [25, 26], the proposed CoCLR incorporates learning from potentially harder positives, e.g. instances from the same class, rather than from only different augmentations of the same instance; Second, CoCLR differs from the oracle proposals of UberNCE since both the CoCLR positive and negative sets may still contain ‘label’ noise, i.e. class-wise false positives and false negatives
  • We have shown that a complementary view of video can be used to bridge the gap between RGB video clip instances of the same class, and that using this to generate positive training sets substantially improves the performance over InfoNCE instance training for video representations
方法
  • CBT [55] MemDPC [26] MIL-NCE [44] MIL-NCE [44] XDC [2] ELO [50] CoCLR CoCLR† CoCLR CoCLR† OPN [42].
  • ST-Puzzle [37] 2019 VCOP [66] DPC [25] CBT [55].
  • SpeedNet [6] 2020 MemDPC [26] 2020 CVRL [51]
结果
  • The authors investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or from both, and make the following contributions: (i) the authors show that an oracle with access to semantic class labels improves the performance of instance-based contrastive learning; (ii) the authors propose a novel self-supervised co-training scheme, CoCLR, to improve the training regime of the popular InfoNCE, exploiting the complementary information from different views of the same data source; and (iii) the authors thoroughly evaluate the quality of the learnt representation on two downstream tasks, video action recognition and retrieval, on UCF101 and HMDB51.
结论
  • When compared with the previous work that used InfoNCE for video selfsupervision, DPC and MemDPC [25, 26], the proposed CoCLR incorporates learning from potentially harder positives, e.g. instances from the same class, rather than from only different augmentations of the same instance; Second, CoCLR differs from the oracle proposals of UberNCE since both the CoCLR positive and negative sets may still contain ‘label’ noise, i.e. class-wise false positives and false negatives.
  • CMC extends positives to include different views, RGB and flow, of the same video clip, but does not introduce positives between clips; CVRL uses InfoNCE contrastive learning with video clips as the instances.
  • The sound of a guitar can link together video clips with very different visual appearances, even if the audio network is relatively untrained.
  • This observation in part explains the success of audio-visual self-supervised learning, e.g.
  • The authors expect that the success of explicit positive mining in CoCLR will lead to applications of CoCLR to other data, e.g. images, other modalities and tasks where other views can be extracted to provide complementary information, and to other learning methods, such as BYOL [23]
总结
  • Introduction:

    The recent progress in self-supervised representation learning for images and videos has demonstrated the benefits of using a discriminative contrastive loss on data samples [12, 13, 27, 28, 45, 59], such as NCE [24, 34].
  • The transformations can be artificial, such as those used in data augmentation [12], or natural, such as those arising in videos from temporal segments within the same clip
  • In essence, these pretext tasks focus on instance discrimination: each data sample is treated as a ‘class’, and the objective is to discriminate its own augmented version from a large number of other data samples or their augmented versions.
  • In the experiments, training with UberNCE outperforms the supervised model trained with cross-entropy, a phenomenon that is observed in a concurrent work [36] for image classification
  • Objectives:

    The objective of this paper is visual-only self-supervised video representation learning.
  • Methods:

    CBT [55] MemDPC [26] MIL-NCE [44] MIL-NCE [44] XDC [2] ELO [50] CoCLR CoCLR† CoCLR CoCLR† OPN [42].
  • ST-Puzzle [37] 2019 VCOP [66] DPC [25] CBT [55].
  • SpeedNet [6] 2020 MemDPC [26] 2020 CVRL [51]
  • Results:

    The authors investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or from both, and make the following contributions: (i) the authors show that an oracle with access to semantic class labels improves the performance of instance-based contrastive learning; (ii) the authors propose a novel self-supervised co-training scheme, CoCLR, to improve the training regime of the popular InfoNCE, exploiting the complementary information from different views of the same data source; and (iii) the authors thoroughly evaluate the quality of the learnt representation on two downstream tasks, video action recognition and retrieval, on UCF101 and HMDB51.
  • Conclusion:

    When compared with the previous work that used InfoNCE for video selfsupervision, DPC and MemDPC [25, 26], the proposed CoCLR incorporates learning from potentially harder positives, e.g. instances from the same class, rather than from only different augmentations of the same instance; Second, CoCLR differs from the oracle proposals of UberNCE since both the CoCLR positive and negative sets may still contain ‘label’ noise, i.e. class-wise false positives and false negatives.
  • CMC extends positives to include different views, RGB and flow, of the same video clip, but does not introduce positives between clips; CVRL uses InfoNCE contrastive learning with video clips as the instances.
  • The sound of a guitar can link together video clips with very different visual appearances, even if the audio network is relatively untrained.
  • This observation in part explains the success of audio-visual self-supervised learning, e.g.
  • The authors expect that the success of explicit positive mining in CoCLR will lead to applications of CoCLR to other data, e.g. images, other modalities and tasks where other views can be extracted to provide complementary information, and to other learning methods, such as BYOL [23]
表格
  • Table1: Representations from InfoNCE, UberNCE and CoCLR are evaluated on downstream action classification and retrieval. Left refers to the setting for pre-training. CMC§ is our implementation for a fair comparison to CoCLR, i.e. S3D architecture, trained with 500 epochs. † refers to the results from two-stream networks (RGB + Flow). Cross-Ent. is end-to-end training with Softmax Cross-Entropy. In terms of number of the samples mined in Eq 3 and Eq 5, K = 5 is the optimal setting, i.e. the the Top5 most similar samples are used to train the target representation. Other values, K = 1 and K = 50 are slightly worse. In terms of alternation granularity, we compare with the extreme case that the two representations are optimized simultaneously (CoCLRK=5; sim), again, this performs slightly worse than training one network with the other fixed to ‘act as’ an oracle. We conjecture that the inferior performance of simultaneous optimization is because the weights of the network are updated too fast, similar phenomena have also been observed in other works [<a class="ref-link" id="c27" href="#r27">27</a>, <a class="ref-link" id="c56" href="#r56">56</a>], we leave further investigation of this to future work
  • Table2: Comparison with state-of-the-art approaches. In the left columns, we show the pre-training setting, e.g. dataset, resolution, architectures with encoder depth, modality. In the right columns, the top-1 accuracy is reported on the downstream action classification task for different datasets, e.g. UCF, HMDB, K400. The dataset parenthesis shows the total video duration in time (d for day, y for year). ‘Frozen ’ means the network is end-to-end finetuned from the pretrained representation, shown in the top half of the table; ‘Frozen ’ means the pretrained representation is fixed and classified with a linear layer, shown in the bottom half. For input, ‘V’ refers to visual only (colored with blue), ‘A’ is audio, ‘T’ is text narration. CoCLR models with † refer to the two-stream networks, where the predictions from RGB and Flow networks are averaged
  • Table3: Comparison with others on Nearest-Neighbour video retrieval on UCF101 and HMDB51. Testing set clips are used to retrieve training set videos and R@k is reported, where k ∈ [<a class="ref-link" id="c1" href="#r1">1</a>, <a class="ref-link" id="c5" href="#r5">5</a>, <a class="ref-link" id="c10" href="#r10">10</a>, <a class="ref-link" id="c20" href="#r20">20</a>]. Note that all the models reported were only pretrained on UCF101 with self-supervised learning except SpeedNet. CoCLR† refers to the two-stream network, where the feature similarity scores from RGB and Flow networks are averaged
  • Table4: Feature encoder architecture at the pretraining stage. ‘FC-1024’ and ‘FC-128’ denote the output dimension of each fully-connected layer respectively
  • Table5: Classifier architecture for evaluating the representation on action classification tasks. ‘FC-num_class’ denotes the output dimension of fully-connected layer is the number of action classes
Download tables as Excel
相关工作
  • Visual-only Self-supervised Learning. Self-supervised visual representation learning has recently witnessed rapid progress in image classification. Early work in this area defined proxy tasks explicitly, for example, colorization, inpainting, and jigsaw solving [15, 16, 48, 68]. More recent approaches jointly optimize clustering and representation learning [5, 9, 10] or learn visual representation by discriminating instances from each other through contrastive learning [12, 27, 28, 29, 32, 45, 57, 59, 70]. Videos offer additional opportunities for learning representations, beyond those of images, by exploiting spatio-temporal information, for example, by ordering frames or clips [21, 42, 46, 64, 66], motion [1, 14, 31], co-occurrence [30], jigsaw [37], rotation [33], speed prediction [6, 17, 62], future prediction [25, 26, 60], or by temporal coherence [40, 41, 61, 63].

    Multi-modal Self-supervised Learning. This research area focuses on leveraging the interplay of different modalities, for instance, contrastive loss is used to learn the correspondence between frames and audio [2, 3, 4, 38, 49, 50], or video and narrations [44]; or, alternatively, an iterative clustering and re-labelling approach for video and audio has been used in [2].
基金
  • Funding for this research is provided by a Google-DeepMind Graduate Scholarship, and by the EPSRC Programme Grant Seebibyte EP/M013774/1
研究对象与分析
samples: 32
The learning rate is decayed down by 1/10 twice when the validation loss plateaus. Each experiment is trained on 4 GPUs, with a batch size of 32 samples per GPU. 4.3 Downstream tasks for representation evaluation

引用论文
  • P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In Proc. ICCV, pages 37–45. IEEE, 2015. 3
    Google ScholarLocate open access versionFindings
  • H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667, 2019. 3, 8, 9
    Findings
  • R. Arandjelovicand A. Zisserman. Look, listen and learn. In Proc. ICCV, 2017. 3, 9
    Google ScholarLocate open access versionFindings
  • R. Arandjelovicand A. Zisserman. Objects that sound. In Proc. ECCV, 2018. 3, 9
    Google ScholarLocate open access versionFindings
  • Y. M. Asano, C. Rupprecht, and A. Vedaldi. Self-labelling via simultaneous clustering and representation learning. In Proc. ICLR, 2020. 2
    Google ScholarLocate open access versionFindings
  • S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Rubinstein, M. Irani, and T. Dekel. SpeedNet: Learning the Speediness in Videos. In Proc. CVPR, 2020. 3, 8, 9
    Google ScholarLocate open access versionFindings
  • A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998. 3
    Google ScholarLocate open access versionFindings
  • U. Büchler, B. Brattoli, and B. Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In Proc. ECCV, 2019
    Google ScholarLocate open access versionFindings
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In Proc. ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • M. Caron, P. Bojanowski, J. Mairal, and A. Joulin. Unsupervised pre-training of image features on non-curated data. In Proc. ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proc. CVPR, 2017. 5
    Google ScholarLocate open access versionFindings
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. 1, 2, 3, 5
    Findings
  • X. Chen, H. Fan, R. Girshick, and K. He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 1, 5
    Findings
  • A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen. DynamoNet: Dynamic Action and Motion Network. In Proc. ICCV, 2019. 3, 8
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. Efros. Unsupervised visual representation learning by context prediction. In Proc. ICCV, 202
    Google ScholarLocate open access versionFindings
  • C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In Proc. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. Epstein, B. Chen, and C. Vondrick. Oops! predicting unintentional action in video. In Proc. CVPR, 2020. 3
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognitionion. In Proc. CVPR, 2020. 3
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He. SlowFast Networks for Video Recognition. In Proc. ICCV, 203, 7
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proc. CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould. Self-supervised video representation learning with odd-one-out networks. In Proc. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014. 3
    Google ScholarLocate open access versionFindings
  • J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020. 9
    Google ScholarLocate open access versionFindings
  • M. U. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010. 1
    Google ScholarLocate open access versionFindings
  • T. Han, W. Xie, and A. Zisserman. Video representation learning by dense predictive coding. In Workshop on Large Scale Holistic Video Understanding, ICCV, 2019. 3, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • T. Han, W. Xie, and A. Zisserman. Memory-augmented dense predictive coding for video representation learning. In Proc. ECCV, 2020. 3, 5, 6, 8, 9
    Google ScholarLocate open access versionFindings
  • K. He, H. Fan, A. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proc. CVPR, 2020. 1, 2, 5, 7, 13
    Google ScholarLocate open access versionFindings
  • O. J. Hénaff, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019. 1, 2
    Findings
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. In Proc. ICLR, 2019. 2
    Google ScholarLocate open access versionFindings
  • P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson. Learning visual groups from co-occurrences in space and time. In Proc. ICLR, 2015. 3
    Google ScholarLocate open access versionFindings
  • D. Jayaraman and K. Grauman. Learning image representations tied to ego-motion. In Proc. ICCV, 2015. 3
    Google ScholarLocate open access versionFindings
  • X. Ji, J. F. Henriques, and A. Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In Proc. ICCV, pages 9865–9874, 2019. 2
    Google ScholarLocate open access versionFindings
  • L. Jing and Y. Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018. 3, 8
    Findings
  • R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016. 1
    Findings
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 5
    Findings
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020. 1
    Findings
  • D. Kim, D. Cho, and I. S. Kweon. Self-supervised video representation learning with space-time cubic puzzles. In AAAI, 2019. 3, 8
    Google ScholarLocate open access versionFindings
  • B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018. 3, 8, 9
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In Proc. ICCV, pages 2556–2563, 2011. 5
    Google ScholarLocate open access versionFindings
  • Z. Lai, E. Lu, and W. Xie. MAST: A memory-augmented self-supervised tracker. In Proc. CVPR, 2020. 3
    Google ScholarLocate open access versionFindings
  • Z. Lai and W. Xie. Self-supervised learning for video correspondence flow. In Proc. BMVC, 2019. 3
    Google ScholarLocate open access versionFindings
  • H. Lee, J. Huang, M. Singh, and M. Yang. Unsupervised representation learning by sorting sequence. In Proc. ICCV, 2017. 2, 8, 9
    Google ScholarLocate open access versionFindings
  • D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang. Video cloze procedure for self-supervised spatio-temporal learning. In AAAI, 2020. 6, 9
    Google ScholarLocate open access versionFindings
  • A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In Proc. CVPR, 2020. 3, 4, 8, 9
    Google ScholarLocate open access versionFindings
  • I. Misra and L. van der Maaten. Self-supervised learning of pretext-invariant representations. In Proc. CVPR, 2020. 1, 2
    Google ScholarLocate open access versionFindings
  • I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In Proc. ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proc. ECCV, pages 69–84. Springer, 2016. 9
    Google ScholarLocate open access versionFindings
  • D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proc. CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298, 2020. 3, 8, 9
    Findings
  • A. Piergiovanni, A. Angelova, and M. S. Ryoo. Evolving losses for unsupervised video representation learning. In Proc. CVPR, 2020. 3, 8
    Google ScholarLocate open access versionFindings
  • R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie, and Y. Cui. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800, 2020. 5, 7, 8
    Findings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014. 3
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5
    Findings
  • J. C. Stroud, D. A. Ross, C. Sun, J. Deng, and R. Sukthankar. D3D: distilled 3d networks for video action recognition. arXiv preprint arXiv:1812.08249, 2018. 3
    Findings
  • C. Sun, F. Baradel, K. Murphy, and C. Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019. 8
    Findings
  • R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 2000. 3, 7
    Google ScholarLocate open access versionFindings
  • Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. In Proc. ECCV, 2019. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proc. CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 1, 2
    Findings
  • C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabelled video. In Proc. CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy. Tracking emerges by colorizing videos. In Proc. ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • J. Wang, J. Jiao, and Y.-H. Liu. Self-supervised video representation learning by pace prediction. In Proc.
    Google ScholarLocate open access versionFindings
  • X. Wang, A. Jabri, and A. A. Efros. Learning correspondence from the cycle-consistency of time. In Proc.
    Google ScholarLocate open access versionFindings
  • D. Wei, J. Lim, A. Zisserman, and W. T. Freeman. Learning and using the arrow of time. In Proc. CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. In Proc. ECCV, 2018. 3, 5, 8
    Google ScholarLocate open access versionFindings
  • D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In Proc. CVPR, 2019. 2, 6, 8, 9
    Google ScholarLocate open access versionFindings
  • C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1 optical flow. In Pattern Recognition, 2007. 5
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In Proc. ECCV, pages 649–666. Springer, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Zhao and C. Snoek. Dance with flow: Two-in-one stream action detection. In Proc. CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • C. Zhuang, A. L. Zhai, and D. Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proc. ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
作者
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科