Self-Supervised MultiModal Versatile Networks
NeurIPS, 2020.
EI
Weibo:
Abstract:
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose repre...More
Code:
Data:
Introduction
- From as far back as the crib, the authors perceive through multi-sensory systems, for instance the authors watch the flames dancing in the fireplace, the authors hear the sound of the crackling wood, as well as feel the heat coming off
- Through this multimodal synchronous perception, the authors learn to draw useful connections between modalities [66] which, in turn, enables them to form good representations of the world.
- The authors seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be compared even when they are never seen together during training; and (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images
Highlights
- Our experience of the world is multimodal
- We seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be compared even when they are never seen together during training; and (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images
- An important note is that since the text modality is directly obtained from the audio track using Automatic Speech Recognition (ASR), we do not construct the audio-text space nor the loss that puts them in alignment explicitly. This is because our goal is not to learn ASR but instead to associate a word, e.g. “car”, with the sound associated with that entity, e.g. the sound produced by the engine
- To comply with the property (iv) of the multimodal versatile network, we introduce a network deflation operation to transform a video network into a network that can ingest a single image
- The state-of-the-art self-supervised models trained on images (SimCLR [11]) outperform MMV due to not having to bridge the video-image domain gap and has been trained on ImageNet images – the performance difference is much smaller on PASCAL
- In this paper we have explored how to train versatile networks for vision, audio and language in a self-supervised manner
Methods
- V )The author Trains data PASCAL ImageNet ImageNet. MMV S3D-G n-def AS+HT def AS+HT i-inf AS+HT Supervised TSM MMV TSM.
- SimCLR [11] ResNet50x4 / ImageNet. R@1↑ R@5↑ R@10↑ MedR ↓ R@1↑ R@5↑ R@10↑ MedR ↓.
- Linear on PASCAL/ImageNet. The authors evaluate the deflated networks using a linear classifier on.
- For TSM, the authors run the image through the backbone network without any channel shift.
- The authors use both train and validation sets as training data.
- The authors resize the images to have a minimum side of
Results
- Table 2 shows the visual and audio representations match or outperform the state-of-the-art on all downstream tasks and evaluation modes.
- Table 3 shows that the deflated networks perform almost as well as the original video model applied on input-inflated 32-frame static videos.
- The state-of-the-art self-supervised models trained on images (SimCLR [11]) outperform MMV due to not having to bridge the video-image domain gap and has been trained on ImageNet images – the performance difference is much smaller on PASCAL.
- The authors' approach is significantly better than pre-training in a fully supervised manner on Kinetics-700 [9]
Conclusion
- In this paper the authors have explored how to train versatile networks for vision, audio and language in a self-supervised manner.
- The authors' network can be used for zero-shot text-to-video retrieval.
- The authors' deflation process shows how to train on videos to obtain representation for still images.
- Given the sheer number of videos available for self-supervised training on the web, the authors believe this is a more natural route to transfer which the authors hope will be pursued in the future
Summary
Introduction:
From as far back as the crib, the authors perceive through multi-sensory systems, for instance the authors watch the flames dancing in the fireplace, the authors hear the sound of the crackling wood, as well as feel the heat coming off- Through this multimodal synchronous perception, the authors learn to draw useful connections between modalities [66] which, in turn, enables them to form good representations of the world.
- The authors seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be compared even when they are never seen together during training; and (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images
Objectives:
The authors' objective is to learn representations from such multimodal experience in a self-supervised manner without resorting to any specific manual annotation.- The authors' goal is to learn a model that has the versatile properties described in Section 1.
- Recall the goal is to be able to embed different modalities into a vector space where semantic comparisons can be made by simple dot products.
- This is because the goal is not to learn ASR but instead to associate a word, e.g.
- This is because the goal is not to learn ASR but instead to associate a word, e.g. “car”, with the sound associated with that entity, e.g. the sound produced by the engine
Methods:
V )The author Trains data PASCAL ImageNet ImageNet. MMV S3D-G n-def AS+HT def AS+HT i-inf AS+HT Supervised TSM MMV TSM.- SimCLR [11] ResNet50x4 / ImageNet. R@1↑ R@5↑ R@10↑ MedR ↓ R@1↑ R@5↑ R@10↑ MedR ↓.
- Linear on PASCAL/ImageNet. The authors evaluate the deflated networks using a linear classifier on.
- For TSM, the authors run the image through the backbone network without any channel shift.
- The authors use both train and validation sets as training data.
- The authors resize the images to have a minimum side of
Results:
Table 2 shows the visual and audio representations match or outperform the state-of-the-art on all downstream tasks and evaluation modes.- Table 3 shows that the deflated networks perform almost as well as the original video model applied on input-inflated 32-frame static videos.
- The state-of-the-art self-supervised models trained on images (SimCLR [11]) outperform MMV due to not having to bridge the video-image domain gap and has been trained on ImageNet images – the performance difference is much smaller on PASCAL.
- The authors' approach is significantly better than pre-training in a fully supervised manner on Kinetics-700 [9]
Conclusion:
In this paper the authors have explored how to train versatile networks for vision, audio and language in a self-supervised manner.- The authors' network can be used for zero-shot text-to-video retrieval.
- The authors' deflation process shows how to train on videos to obtain representation for still images.
- Given the sheer number of videos available for self-supervised training on the web, the authors believe this is a more natural route to transfer which the authors hope will be pursued in the future
Tables
- Table1: Design explorations for multiple modalities (HT=HowTo100M, AS=AudioSet). The video networks use non-linear projection heads
- Table2: Comparison of learnt representations versus the state-of-the-art. Results are averaged over all splits. The “Mod.” column shows which combinations of modalities are used by the methods, possibilities: Vision, Audio, Text, Flow. Training dataset abbreviations: AudioSet, HowTo100M, Instagram65M [<a class="ref-link" id="c20" href="#r20">20</a>], SoundNet [<a class="ref-link" id="c6" href="#r6">6</a>], 2M videos from YouTube8M [<a class="ref-link" id="c1" href="#r1">1</a>]; their length in years is given in the “years” column. †[<a class="ref-link" id="c64" href="#r64">64</a>] uses a non-linear classifier
- Table3: Image classification results on PASCAL and ImageNet. “V)I” denotes the image handling strategy for the video networks: naive deflation (no training of γ and β), deflation (proposed), and input-inflation (video net ingesting 32-frame static videos)
- Table4: Additional retrieval metrics for zero shot text to video retrieval
- Table5: Parameters for FT on downstream classification tasks
- Table6: Effects of varying the visual backbone. All experiments use linear projection heads. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip
- Table7: NCE vs logistic loss for Vision+Audio. All experiments use linear projection heads and the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip
- Table8: Effects of varying the projection heads. All experiments use the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip. Best number is in bold. Second best is underlined
- Table9: Effects of data augmentation for Vision+Audio. All experiments use linear projection heads and the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip
Related work
- Self-supervised learning from single modality. Self-supervised methods design pretext tasks that require no manual annotation but facilitate learning of useful representations of the data. A variety of pretext tasks have been developed for vision (i.e. single modality), such as predicting the relative position of patches [13, 49], colorization [83], predicting orientation [21] or invariance to transformation [15, 29]. In videos, works have also leveraged the temporal dimension [17, 37, 46, 79]. Recently, methods that maximise the similarity between multiple views (augmented versions) of the same image via contrastive losses [8, 11, 26, 27, 28, 50] stand out due to impressive results on the ImageNet benchmark; we draw inspiration from them (e.g. use a contrastive loss and nonlinear projection heads [11]). However, details of view generation are crucial and require careful design [71]. In contrast, we argue that using multiple modalities as different views is simpler and more natural [70].
Reference
- S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. 8
- J.-B. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016. 2
- H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667, 2019. 2, 6, 8
- R. Arandjelovicand A. Zisserman. Look, listen and learn. In ICCV, 2017. 2, 4, 6, 16
- R. Arandjelovicand A. Zisserman. Objects that sound. In ECCV, 2018. 2
- Y. Aytar, C. Vondrick, and A. Torralba. SoundNet: Learning sound representations from unlabeled video. In NIPS, 2012, 8
- Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2012
- P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual information across views. In NeurIPS, 2019. 2
- J. Carreira, E. Noland, C. Hillier, and A. Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019
- J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR, 2017. 3, 5
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020. 2, 6, 7, 9, 14, 17
- M. Chowdhury, P. Rameswar, E. Papalexakis, and A. Roy-Chowdhury. Webly supervised joint embedding for cross-modal image-text retrieval. In ACM MM, 2018. 2
- C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015. 2
- J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang. Dual encoding for zero-example video retrieval. In CVPR, 2019. 2, 3
- A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014. 2
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. IJCV, 88(2):303–338, 2010. 6, 14
- B. Fernando, H. Bilen, E. Gavves, and S. Gould. Self-supervised video representation learning with odd-one-out networks. In CVPR, 202
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. 2
- J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017. 6
- D. Ghadiyaram, D. Tran, and D. Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019. 8
- S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018. 2
- R. Girdhar, D. Tran, L. Torresani, and D. Ramanan. Distinit: Learning video representations without a single labeled video. In ICCV, 2019. 3
- Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 2014. 2
- Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014. 2
- D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. Glass. Jointly discovering visual objects and spoken words from raw sensory input. IJCV, pages 1–22, 2019. 2
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020. 2, 6
- O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019. 2, 4
- R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018. 2
- L. Jing and Y. Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018. 2
- J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2019. 2
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 8, 14
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 6
- B. Klein, G. Lev, G. Sadeh, and L. Wolf. Associating neural word embeddings with deep image representations using Fisher vectors. In CVPR, 2015. 2
- A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019. 5
- B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018. 2, 6, 8, 16
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011. 6
- H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017. 2
- J. Lin, C. Gan, and S. Han. TSM: Temporal shift module for efficient video understanding. In ICCV, 2019.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014. 2
- I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017. 6
- J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy. What’s cookin’? Interpreting cooking videos using text, speech and vision. NAACL, 2015. 2
- A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In CVPR, 2020. 2, 3, 4, 5, 6, 7, 8, 14, 16
- A. Miech, I. Laptev, and J. Sivic. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. 2
- A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019. 3, 6, 8
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 6
- I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV, 2016. 2
- N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR. ACM, 2018. 2
- P. Morgado, N. Vasconcelos, and I. Misra. Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943, 2020. 8
- M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016. 2
- A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 4
- A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018. 2
- A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, 2016. 2
- Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016. 2
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. SpecAugment: A simple data augmentation method for automatic speech recognition. In InterSpeech, 2019. 7, 17
- M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298, 2020. 6, 7, 8, 17
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 13
- K. J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015. 6
- A. Piergiovanni, A. Angelova, and M. S. Ryoo. Evolving losses for unsupervised video representation learning. In CVPR, 2020. 2, 8
- B. A. Plummer, M. Brown, and S. Lazebnik. Enhancing video summarization via vision-language embedding. In CVPR, 2017. 2
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015. 2
- A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie description. IJCV, 2017. 2
- A. Rouditchenko, A. Boggust, D. Harwath, D. Joshi, S. Thomas, K. Audhkhasi, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. arXiv preprint arXiv:2006.09199, 2020. 2
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. 6
- H. B. Sailor, D. M. Agrawal, and H. A. Patil. Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification. In InterSpeech, 2017. 8
- O. Sener, A. R. Zamir, S. Savarese, and A. Saxena. Unsupervised semantic parsing of video collections. In ICCV, December 2015. 2
- L. Smith and M. Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 2005. 1
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 6
- C. Sun, F. Baradel, K. Murphy, and C. Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019. 2
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid. VideoBERT: A joint model for video and language representation learning. In ICCV, 2019. 2
- Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019. 2
- Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243, 2020. 2
- L. Wang, Y. Li, J. Huang, and S. Lazebnik. Learning two-branch neural networks for image-text matching tasks. PAMI, 2018. 2
- L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR, 2016. 2
- J. Weston, S. Bengio, and N. Usunier. WSABIE: Scaling up to large vocabulary image annotation. In IJCAI, 2011. 2
- M. Wray, D. Larlus, G. Csurka, and D. Damen. Fine-grained action retrieval through multiple parts-ofspeech embeddings. In ICCV, 2019. 2
- C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Sampling matters in deep embedding learning. ICCV, 2017. 2
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. 6
- S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speedaccuracy trade-offs in video classification. In ECCV, 2018. 5, 6, 8, 16
- D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019. 2
- J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016. 2, 6
- R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, 2015. 2
- S.-I. Yu, L. Jiang, and A. Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In ACM, 2014. 2
- R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016. 2
- L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 2, 6
Full Text
Tags
Comments