Exploring Relations in Untrimmed Videos for Self-Supervised Learning

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS(2022)

引用 14|浏览130
暂无评分
摘要
Existing video self-supervised learning methods mainly rely on trinuned videos for model training. They apply their methods and verify the effectiveness on trimmed video datasets including UCF101 and Kinetics-400, among others. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not truly unsupervised. In this article, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. After that, some designed sampling strategies are used to model relations for video clips. The strategies are saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of video clips, which is also an essential procedure for video-related tasks. We validate our learned models with action recognition, video retrieval, and action similarity labeling tasks with four kinds of 3D convolutional neural networks. Experimental results show that ERUV is able to learn richer representations with untrimmed videos, and it outperforms state-of-the-art self-supervised methods with significant margins.
更多
查看译文
关键词
Self-supervised learning,action recognition,action detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要