AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We showed that C3D can model appearance and motion information simultaneously and outperforms the 2D convolutional network features on various video analysis tasks

Learning Spatiotemporal Features with 3D Convolutional Networks.

International Conference on Computer Vision, pp.4489-4497, (2015)

Cited by: 5246|Views776
EI

Abstract

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous archite...More

Code:

Data:

0
Introduction
  • Multimedia on the Internet is growing rapidly resulting in an increasing number of videos being shared every minute.
  • There is still a growing need for a generic video descriptor that helps in solving large-scale video tasks in a homogeneous way.
  • Internet videos can be of landscapes, natural scenes, sports, TV shows, movies, pets, food and so on; the descriptor needs to be compact: as the authors are working with millions of videos, a compact descriptor helps processing, storing, and retrieving tasks much more scalable; it needs to be efficient to compute, as thousands of videos are expected to be processed every minute in real world systems; and it must be simple to implement.
  • Instead of using complicated feature encoding methods and classifiers, a good descriptor should work well even with a simple model
Highlights
  • Multimedia on the Internet is growing rapidly resulting in an increasing number of videos being shared every minute
  • Table 3 presents action recognition accuracy of C3D compared with the two baselines and current best methods
  • C3D when combined with improved Dense Trajectories further improves the accuracy to 90.4%, while when it is combined with Imagenet, we observe only 0.6% improvement
  • In this work we try to address the problem of learning spatiotemporal features for videos using 3D convolutional network trained on large-scale video datasets
  • We showed that C3D can model appearance and motion information simultaneously and outperforms the 2D convolutional network features on various video analysis tasks
  • We demonstrated that C3D features with a linear classifier can outperform or approach current best methods on different video analysis benchmarks
Methods
  • Extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips.
  • These clips are passed to the C3D network to extract fc6 activations.
  • These clip fc6 activations are averaged to form a 4096-dim video descriptor which is followed by an L2-normalization
  • The authors refer to this representation as C3D video descriptor/feature in all experiments, unless the authors clearly specify the difference.
  • Spatial stream network [36] LRCN [6]
Results
  • Table 3 presents action recognition accuracy of C3D compared with the two baselines and current best methods.
  • Imagenet baseline performs reasonably well which is just 1.2% below state-of-the-art method [45], but 10.8% worse than C3D due to lack of motion modeling.
  • C3D obtains 22.3% accuracy and outperforms [32] by 10.3% with only linear SVM where the comparing method used RBF-kernel on strong SIFT-RANSAC feature matching.
  • Since C3D is trained only on Sports1M videos without any fine-tuning while Imagenet is fully trained on 1000 object categories, the authors did not expect C3D iDT Brox’s Brox’s C3D
Conclusion
  • In this work the authors try to address the problem of learning spatiotemporal features for videos using 3D ConvNets trained on large-scale video datasets.
  • The authors showed that C3D can model appearance and motion information simultaneously and outperforms the 2D ConvNet features on various video analysis tasks.
  • The authors demonstrated that C3D features with a linear classifier can outperform or approach current best methods on different video analysis benchmarks.
  • C3D source code and pre-trained model are available at http://vlg.cs.dartmouth.edu/c3d
  • The proposed C3D features are efficient, compact, and extremely simple to use.
Tables
  • Table1: C3D compared to best published results. C3D outperforms all previous best reported methods on a range of benchmarks except for Sports-1M and UCF101. On UCF101, we report accuracy for two groups of methods. The first set of methods use only RGB frame inputs while the second set of methods (in parentheses) use all possible features (e.g. optical flow, improved Dense Trajectory)
  • Table2: Sports-1M classification result. C3D outperforms [<a class="ref-link" id="c18" href="#r18">18</a>] by 5% on top-5 video-level accuracy. (*)We note that the method of [<a class="ref-link" id="c29" href="#r29">29</a>] uses long clips, thus its clip-level accuracy is not directly comparable to that of C3D and DeepVideo
  • Table3: Action recognition results on UCF101. C3D compared with baselines and current state-of-the-art methods. Top: simple features with linear SVM; Middle: methods taking only RGB
  • Table4: Action similarity labeling result on ASLAN. C3D significantly outperforms
  • Table5: Scene recognition accuracy. C3D using a simple linear SVM outperforms current methods on Maryland and YUPENN
  • Table6: Runtime analysis on UCF101. C3D is 91x faster than improved dense trajectories [<a class="ref-link" id="c44" href="#r44">44</a>] and 274x faster than Brox’s GPU
Download tables as Excel
Related work
  • Videos have been studied by the computer vision community for decades. Over the years various problems like action recognition [26], anomaly detection [2], video retrieval [1], event and action detection [30, 17], and many more have been proposed. Considerable portion of these works are about video representations. Laptev and Lindeberg [26] proposed spatio-temporal interest points (STIPs) by extending Harris corner detectors to 3D. SIFT and HOG are also extended into SIFT-3D [34] and HOG3D [19] for action recognition. Dollar et al proposed Cuboids features for behavior recognition [5]. Sadanand and Corso built ActionBank for action recognition [33]. Recently, Wang et al proposed improved Dense Trajectories (iDT) [44] which is currently the state-of-the-art hand-crafted feature. The iDT descriptor is an interesting example showing that temporal signals could be handled differently from that of spatial signal. Instead of extending Harris corner detector into 3D, it starts with densely-sampled feature points in video frames and uses optical flows to track them. For each tracker corner different hand-crafted features are extracted along the trajectory. Despite its good performance, this method is computationally intensive and becomes intractable on largescale datasets.
Reference
  • M. Bendersky, L. Garcia-Pueyo, J. Harmsen, V. Josifovski, and D. Lepikhin. Up next: retrieval methods for large scale related video suggestion. In ACM SIGKDD, pages 1769–1778, 2014. 2
    Google ScholarLocate open access versionFindings
  • O. Boiman and M. Irani. Detecting irregularities in images and in video. IJCV, 2007. 1, 2
    Google ScholarLocate open access versionFindings
  • T. Brox and J. Malik. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE TPAMI, 33(3):500– 513, 2011. 8
    Google ScholarLocate open access versionFindings
  • K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In CVPR, 2012. 8
    Google ScholarLocate open access versionFindings
  • P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In Proc. ICCV VS-PETS, 2002
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CoRR, abs/1411.4389, 2014. 6
    Findings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2013. 2, 7
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spacetime forests with complementary features for dynamic scene recognition. In BMVC, 2013. 8
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes. Bags of spacetime energies for dynamic scene recognition. In CVPR, 2014. 2, 8
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013. 2
    Findings
  • Y. Hanani, N. Levy, and L. Wolf. Evaluating new variants of motion interchange patterns. In CVPR workshop, 2013. 7
    Google ScholarLocate open access versionFindings
  • A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler. Learning human pose estimation features with convolutional networks. In ICLR, 2014. 2
    Google ScholarLocate open access versionFindings
  • A. Jain, J. Tompson, Y. LeCun, and C. Bregler. Modeep: A deep learning framework using motion features for human pose estimation. In ACCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • V. Jain, B. Bollmann, M. Richardson, D. Berger, M. Helmstaedter, K. Briggman, W. Denk, J. Bowden, J. Mendenhall, W. Abraham, K. Harris, N. Kasthuri, K. Hayworth, R. Schalek, J. Tapia, J. Lichtman, and H. Seung. Boundary learning by optimization with topological constraints. In CVPR, 2010. 2
    Google ScholarLocate open access versionFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013. 1, 2
    Google ScholarLocate open access versionFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 1, 5
    Findings
  • Y. Jiang, J. Liu, A. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes, 2014. 2
    Google ScholarFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
    Google ScholarLocate open access versionFindings
  • A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 2
    Google ScholarLocate open access versionFindings
  • O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, 2012. 7
    Google ScholarLocate open access versionFindings
  • O. Kliper-Gross, T. Hassner, and L. Wolf. The action similarity labeling challenge. TPAMI, 2012. 7
    Google ScholarLocate open access versionFindings
  • O. Kliper-Grossa, T. Hassner, and L. Wolf. The one shot similarity metric learning for action recognition. In Workshop on SIMBAD, 2011. 7
    Google ScholarLocate open access versionFindings
  • D. B. Kris M. Kitani, Brian D. Ziebart and M. Hebert. Activity forecasting. In ECCV, 2012. 1
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2
    Google ScholarLocate open access versionFindings
  • Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, abs/1411.6660, 2014. 2, 6, 7
    Findings
  • I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003. 1, 2
    Google ScholarLocate open access versionFindings
  • Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. Brain Theory and Neural Networks, 1995. 2
    Google ScholarLocate open access versionFindings
  • J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015. 2, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. Smeaton, and G. Quenot. Trecvid’14–an overview of the goals, tasks, data, evaluation and metrics. In TRECVID, 2014. 2
    Google ScholarLocate open access versionFindings
  • X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR, abs/1405.4506, 2014. 2, 6, 7
    Findings
  • X. Ren and M. Philipose. Egocentric recognition of handled objects: Benchmark and analysis. In Egocentric Vision workshop, 2009. 2, 8
    Google ScholarLocate open access versionFindings
  • S. Sadanand and J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012. 2
    Google ScholarLocate open access versionFindings
  • P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM MM, 2007. 2
    Google ScholarLocate open access versionFindings
  • N. Shroff, P. K. Turaga, and R. Chellappa. Moving vistas: Exploiting motion for describing scenes. In CVPR, 2010. 8
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 2, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 3, 4
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 5, 7
    Google ScholarLocate open access versionFindings
  • N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015. 2, 6
    Google ScholarLocate open access versionFindings
  • G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, pages 140–153. Springer, 2010. 2
    Google ScholarLocate open access versionFindings
  • C. Theriault, N. Thome, and M. Cord. Dynamic scene classification: Learning motion descriptors with slow features analysis. In CVPR, 2013. 8
    Google ScholarLocate open access versionFindings
  • S. Turaga, J. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and S. Seung. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Comp., 2010. 2
    Google ScholarFindings
  • L. van der Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 9(2579-2605):85, 2008. 6, 7
    Google ScholarLocate open access versionFindings
  • H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 2, 5, 7, 8
    Google ScholarFindings
  • Q. P. X. Peng, Y. Qiao and Q. Wang. Large margin dimensionality reduction for action similarity labeling. IEEE Signal Processing Letter, 2014. 7
    Google ScholarLocate open access versionFindings
  • M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 5, 6
    Google ScholarLocate open access versionFindings
  • N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014. 1
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科