Explore Video Clip Order With Self-Supervised and Curriculum Learning for Video Applications

Jun Xiao,Lin Li,Dejing Xu,Chengjiang Long,Jian Shao,Shifeng Zhang,Shiliang Pu,Yueting Zhuang

IEEE TRANSACTIONS ON MULTIMEDIA（2021）

引用 8|浏览53

暂无评分

摘要

We present a self-supervised spatiotemporal learning approach by exploring the temporal coherence of videos. The chronological order of shuffled clips from the video is used as the supervisory signal to guide the 3D Convolutional Neural Networks (CNNs) to learn meaningful visual knowledge. Unlike the existing approaches which use frames, we utilize dynamic video clips to reduce the uncertainty of order. We test three types of representative 3D CNNs, all of which benefit from the proposed approach. The learned 3D CNNs can be used either as a feature extractor or a pre-trained model for further fine-tuning on downstream tasks. We also propose two curriculum learning strategies to make the 3D CNNs easier to train and get the state-of-the-art results in nearest neighbor retrieval and action recognition tasks compared with other self-supervised learning methods. Meanwhile, it is further extended to the field of visual question answering application and has achieved promising results. Besides, comprehensive and extensive experimental results and analyses are provided for readers to better understand the video clip order we explore with self-supervised and curriculum learning for video application.

查看译文

关键词

Task analysis, Three-dimensional displays, Feature extraction, Two dimensional displays, Training, Convolution, Knowledge discovery, Action recognition, curriculum learning, nearest neighbor retrieval, self-supervised learning, video question answering

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要