Disentangling Motion, Foreground and Background Features in Videos.

arXiv: Computer Vision and Pattern Recognition(2017)

引用 23|浏览21
暂无评分
摘要
This paper instroduces an unsupervised framework toextract semantically rich features for video representation.Inspired by how the human visual system groups objectsbased on motion cues, we propose a deep convolutionalneural network that disentangles motion, foreground andbackground information. The proposed architecture consistsof a 3D convolutional feature encoder for blocks of 16frames, which is trained for reconstruction tasks over thefirst and last frames of the sequence. The model is trainedwith a fraction of videos from the UCF-101 dataset taking asground truth the bounding boxes around the activity regions.Qualitative results indicate that the network can successfullyupdate the foreground appearance based on pure-motionfeatures. The benefits of these learned features are shownin a discriminative classification task when compared witha random initialization of the network weights, providing again of accuracy above the 10%.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要