Self-Supervised MultiModal Versatile Networks

NeurIPS, 2020.

Cited by: 1|Bibtex|Views66
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
In this paper we have explored how to train versatile networks for vision, audio and language in a self-supervised manner

Abstract:

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose repre...More

Code:

Data:

0
Introduction
  • From as far back as the crib, the authors perceive through multi-sensory systems, for instance the authors watch the flames dancing in the fireplace, the authors hear the sound of the crackling wood, as well as feel the heat coming off
  • Through this multimodal synchronous perception, the authors learn to draw useful connections between modalities [66] which, in turn, enables them to form good representations of the world.
  • The authors seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be compared even when they are never seen together during training; and (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images
Highlights
  • Our experience of the world is multimodal
  • We seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be compared even when they are never seen together during training; and (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images
  • An important note is that since the text modality is directly obtained from the audio track using Automatic Speech Recognition (ASR), we do not construct the audio-text space nor the loss that puts them in alignment explicitly. This is because our goal is not to learn ASR but instead to associate a word, e.g. “car”, with the sound associated with that entity, e.g. the sound produced by the engine
  • To comply with the property (iv) of the multimodal versatile network, we introduce a network deflation operation to transform a video network into a network that can ingest a single image
  • The state-of-the-art self-supervised models trained on images (SimCLR [11]) outperform MMV due to not having to bridge the video-image domain gap and has been trained on ImageNet images – the performance difference is much smaller on PASCAL
  • In this paper we have explored how to train versatile networks for vision, audio and language in a self-supervised manner
Methods
  • V )The author Trains data PASCAL ImageNet ImageNet. MMV S3D-G n-def AS+HT def AS+HT i-inf AS+HT Supervised TSM MMV TSM.
  • SimCLR [11] ResNet50x4 / ImageNet. R@1↑ R@5↑ R@10↑ MedR ↓ R@1↑ R@5↑ R@10↑ MedR ↓.
  • Linear on PASCAL/ImageNet. The authors evaluate the deflated networks using a linear classifier on.
  • For TSM, the authors run the image through the backbone network without any channel shift.
  • The authors use both train and validation sets as training data.
  • The authors resize the images to have a minimum side of
Results
  • Table 2 shows the visual and audio representations match or outperform the state-of-the-art on all downstream tasks and evaluation modes.
  • Table 3 shows that the deflated networks perform almost as well as the original video model applied on input-inflated 32-frame static videos.
  • The state-of-the-art self-supervised models trained on images (SimCLR [11]) outperform MMV due to not having to bridge the video-image domain gap and has been trained on ImageNet images – the performance difference is much smaller on PASCAL.
  • The authors' approach is significantly better than pre-training in a fully supervised manner on Kinetics-700 [9]
Conclusion
  • In this paper the authors have explored how to train versatile networks for vision, audio and language in a self-supervised manner.
  • The authors' network can be used for zero-shot text-to-video retrieval.
  • The authors' deflation process shows how to train on videos to obtain representation for still images.
  • Given the sheer number of videos available for self-supervised training on the web, the authors believe this is a more natural route to transfer which the authors hope will be pursued in the future
Summary
  • Introduction:

    From as far back as the crib, the authors perceive through multi-sensory systems, for instance the authors watch the flames dancing in the fireplace, the authors hear the sound of the crackling wood, as well as feel the heat coming off
  • Through this multimodal synchronous perception, the authors learn to draw useful connections between modalities [66] which, in turn, enables them to form good representations of the world.
  • The authors seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be compared even when they are never seen together during training; and (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images
  • Objectives:

    The authors' objective is to learn representations from such multimodal experience in a self-supervised manner without resorting to any specific manual annotation.
  • The authors' goal is to learn a model that has the versatile properties described in Section 1.
  • Recall the goal is to be able to embed different modalities into a vector space where semantic comparisons can be made by simple dot products.
  • This is because the goal is not to learn ASR but instead to associate a word, e.g.
  • This is because the goal is not to learn ASR but instead to associate a word, e.g. “car”, with the sound associated with that entity, e.g. the sound produced by the engine
  • Methods:

    V )The author Trains data PASCAL ImageNet ImageNet. MMV S3D-G n-def AS+HT def AS+HT i-inf AS+HT Supervised TSM MMV TSM.
  • SimCLR [11] ResNet50x4 / ImageNet. R@1↑ R@5↑ R@10↑ MedR ↓ R@1↑ R@5↑ R@10↑ MedR ↓.
  • Linear on PASCAL/ImageNet. The authors evaluate the deflated networks using a linear classifier on.
  • For TSM, the authors run the image through the backbone network without any channel shift.
  • The authors use both train and validation sets as training data.
  • The authors resize the images to have a minimum side of
  • Results:

    Table 2 shows the visual and audio representations match or outperform the state-of-the-art on all downstream tasks and evaluation modes.
  • Table 3 shows that the deflated networks perform almost as well as the original video model applied on input-inflated 32-frame static videos.
  • The state-of-the-art self-supervised models trained on images (SimCLR [11]) outperform MMV due to not having to bridge the video-image domain gap and has been trained on ImageNet images – the performance difference is much smaller on PASCAL.
  • The authors' approach is significantly better than pre-training in a fully supervised manner on Kinetics-700 [9]
  • Conclusion:

    In this paper the authors have explored how to train versatile networks for vision, audio and language in a self-supervised manner.
  • The authors' network can be used for zero-shot text-to-video retrieval.
  • The authors' deflation process shows how to train on videos to obtain representation for still images.
  • Given the sheer number of videos available for self-supervised training on the web, the authors believe this is a more natural route to transfer which the authors hope will be pursued in the future
Tables
  • Table1: Design explorations for multiple modalities (HT=HowTo100M, AS=AudioSet). The video networks use non-linear projection heads
  • Table2: Comparison of learnt representations versus the state-of-the-art. Results are averaged over all splits. The “Mod.” column shows which combinations of modalities are used by the methods, possibilities: Vision, Audio, Text, Flow. Training dataset abbreviations: AudioSet, HowTo100M, Instagram65M [<a class="ref-link" id="c20" href="#r20">20</a>], SoundNet [<a class="ref-link" id="c6" href="#r6">6</a>], 2M videos from YouTube8M [<a class="ref-link" id="c1" href="#r1">1</a>]; their length in years is given in the “years” column. †[<a class="ref-link" id="c64" href="#r64">64</a>] uses a non-linear classifier
  • Table3: Image classification results on PASCAL and ImageNet. “V)I” denotes the image handling strategy for the video networks: naive deflation (no training of γ and β), deflation (proposed), and input-inflation (video net ingesting 32-frame static videos)
  • Table4: Additional retrieval metrics for zero shot text to video retrieval
  • Table5: Parameters for FT on downstream classification tasks
  • Table6: Effects of varying the visual backbone. All experiments use linear projection heads. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip
  • Table7: NCE vs logistic loss for Vision+Audio. All experiments use linear projection heads and the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip
  • Table8: Effects of varying the projection heads. All experiments use the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip. Best number is in bold. Second best is underlined
  • Table9: Effects of data augmentation for Vision+Audio. All experiments use linear projection heads and the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip
Download tables as Excel
Related work
  • Self-supervised learning from single modality. Self-supervised methods design pretext tasks that require no manual annotation but facilitate learning of useful representations of the data. A variety of pretext tasks have been developed for vision (i.e. single modality), such as predicting the relative position of patches [13, 49], colorization [83], predicting orientation [21] or invariance to transformation [15, 29]. In videos, works have also leveraged the temporal dimension [17, 37, 46, 79]. Recently, methods that maximise the similarity between multiple views (augmented versions) of the same image via contrastive losses [8, 11, 26, 27, 28, 50] stand out due to impressive results on the ImageNet benchmark; we draw inspiration from them (e.g. use a contrastive loss and nonlinear projection heads [11]). However, details of view generation are crucial and require careful design [71]. In contrast, we argue that using multiple modalities as different views is simpler and more natural [70].
Reference
  • S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. 8
    Findings
  • J.-B. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667, 2019. 2, 6, 8
    Findings
  • R. Arandjelovicand A. Zisserman. Look, listen and learn. In ICCV, 2017. 2, 4, 6, 16
    Google ScholarFindings
  • R. Arandjelovicand A. Zisserman. Objects that sound. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Y. Aytar, C. Vondrick, and A. Torralba. SoundNet: Learning sound representations from unlabeled video. In NIPS, 2012, 8
    Google ScholarLocate open access versionFindings
  • Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2012
    Findings
  • P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual information across views. In NeurIPS, 2019. 2
    Google ScholarLocate open access versionFindings
  • J. Carreira, E. Noland, C. Hillier, and A. Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019
    Findings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR, 2017. 3, 5
    Google ScholarLocate open access versionFindings
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. 2, 6, 7, 9, 14, 17
    Findings
  • M. Chowdhury, P. Rameswar, E. Papalexakis, and A. Roy-Chowdhury. Webly supervised joint embedding for cross-modal image-text retrieval. In ACM MM, 2018. 2
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang. Dual encoding for zero-example video retrieval. In CVPR, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014. 2
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. IJCV, 88(2):303–338, 2010. 6, 14
    Google ScholarLocate open access versionFindings
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould. Self-supervised video representation learning with odd-one-out networks. In CVPR, 202
    Google ScholarLocate open access versionFindings
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. 2
    Google ScholarFindings
  • J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017. 6
    Google ScholarLocate open access versionFindings
  • D. Ghadiyaram, D. Tran, and D. Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019. 8
    Google ScholarLocate open access versionFindings
  • S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • R. Girdhar, D. Tran, L. Torresani, and D. Ramanan. Distinit: Learning video representations without a single labeled video. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. Glass. Jointly discovering visual objects and spoken words from raw sensory input. IJCV, pages 1–22, 2019. 2
    Google ScholarLocate open access versionFindings
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020. 2, 6
    Google ScholarLocate open access versionFindings
  • O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019. 2, 4
    Findings
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018. 2
    Findings
  • L. Jing and Y. Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018. 2
    Findings
  • J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2019. 2
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 8, 14
    Findings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 6
    Google ScholarLocate open access versionFindings
  • B. Klein, G. Lev, G. Sadeh, and L. Wolf. Associating neural word embeddings with deep image representations using Fisher vectors. In CVPR, 2015. 2
    Google ScholarFindings
  • A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019. 5
    Google ScholarLocate open access versionFindings
  • B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018. 2, 6, 8, 16
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011. 6
    Google ScholarLocate open access versionFindings
  • H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Lin, C. Gan, and S. Han. TSM: Temporal shift module for efficient video understanding. In ICCV, 2019.
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017. 6
    Google ScholarLocate open access versionFindings
  • J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy. What’s cookin’? Interpreting cooking videos using text, speech and vision. NAACL, 2015. 2
    Google ScholarLocate open access versionFindings
  • A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In CVPR, 2020. 2, 3, 4, 5, 6, 7, 8, 14, 16
    Google ScholarLocate open access versionFindings
  • A. Miech, I. Laptev, and J. Sivic. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. 2
    Findings
  • A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019. 3, 6, 8
    Google ScholarLocate open access versionFindings
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 6
    Findings
  • I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR. ACM, 2018. 2
    Google ScholarFindings
  • P. Morgado, N. Vasconcelos, and I. Misra. Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020. 8
    Findings
  • M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 4
    Findings
  • A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. SpecAugment: A simple data augmentation method for automatic speech recognition. In InterSpeech, 2019. 7, 17
    Google ScholarLocate open access versionFindings
  • M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298, 2020. 6, 7, 8, 17
    Findings
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 13
    Google ScholarLocate open access versionFindings
  • K. J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015. 6
    Google ScholarLocate open access versionFindings
  • A. Piergiovanni, A. Angelova, and M. S. Ryoo. Evolving losses for unsupervised video representation learning. In CVPR, 2020. 2, 8
    Google ScholarLocate open access versionFindings
  • B. A. Plummer, M. Brown, and S. Lazebnik. Enhancing video summarization via vision-language embedding. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie description. IJCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • A. Rouditchenko, A. Boggust, D. Harwath, D. Joshi, S. Thomas, K. Audhkhasi, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. arXiv preprint arXiv:2006.09199, 2020. 2
    Findings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. 6
    Google ScholarLocate open access versionFindings
  • H. B. Sailor, D. M. Agrawal, and H. A. Patil. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. In InterSpeech, 2017. 8
    Google ScholarLocate open access versionFindings
  • O. Sener, A. R. Zamir, S. Savarese, and A. Saxena. Unsupervised semantic parsing of video collections. In ICCV, December 2015. 2
    Google ScholarLocate open access versionFindings
  • L. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 2005. 1
    Google ScholarFindings
  • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 6
    Findings
  • C. Sun, F. Baradel, K. Murphy, and C. Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019. 2
    Findings
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid. VideoBERT: A joint model for video and language representation learning. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019. 2
    Findings
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243, 2020. 2
    Findings
  • L. Wang, Y. Li, J. Huang, and S. Lazebnik. Learning two-branch neural networks for image-text matching tasks. PAMI, 2018. 2
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Weston, S. Bengio, and N. Usunier. WSABIE: Scaling up to large vocabulary image annotation. In IJCAI, 2011. 2
    Google ScholarLocate open access versionFindings
  • M. Wray, D. Larlus, G. Csurka, and D. Damen. Fine-grained action retrieval through multiple parts-ofspeech embeddings. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. 6
    Google ScholarLocate open access versionFindings
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speedaccuracy trade-offs in video classification. In ECCV, 2018. 5, 6, 8, 16
    Google ScholarLocate open access versionFindings
  • D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016. 2, 6
    Google ScholarLocate open access versionFindings
  • R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, 2015. 2
    Google ScholarLocate open access versionFindings
  • S.-I. Yu, L. Jiang, and A. Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In ACM, 2014. 2
    Google ScholarFindings
  • R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 2, 6
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments