AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Learning Spatiotemporal Features with 3D Convolutional Networks.
International Conference on Computer Vision, pp.4489-4497, (2015)
- Multimedia on the Internet is growing rapidly resulting in an increasing number of videos being shared every minute.
- There is still a growing need for a generic video descriptor that helps in solving large-scale video tasks in a homogeneous way.
- Internet videos can be of landscapes, natural scenes, sports, TV shows, movies, pets, food and so on; the descriptor needs to be compact: as the authors are working with millions of videos, a compact descriptor helps processing, storing, and retrieving tasks much more scalable; it needs to be efficient to compute, as thousands of videos are expected to be processed every minute in real world systems; and it must be simple to implement.
- Instead of using complicated feature encoding methods and classifiers, a good descriptor should work well even with a simple model
- Multimedia on the Internet is growing rapidly resulting in an increasing number of videos being shared every minute
- Table 3 presents action recognition accuracy of C3D compared with the two baselines and current best methods
- C3D when combined with improved Dense Trajectories further improves the accuracy to 90.4%, while when it is combined with Imagenet, we observe only 0.6% improvement
- In this work we try to address the problem of learning spatiotemporal features for videos using 3D convolutional network trained on large-scale video datasets
- We showed that C3D can model appearance and motion information simultaneously and outperforms the 2D convolutional network features on various video analysis tasks
- We demonstrated that C3D features with a linear classifier can outperform or approach current best methods on different video analysis benchmarks
- Extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips.
- These clips are passed to the C3D network to extract fc6 activations.
- These clip fc6 activations are averaged to form a 4096-dim video descriptor which is followed by an L2-normalization
- The authors refer to this representation as C3D video descriptor/feature in all experiments, unless the authors clearly specify the difference.
- Spatial stream network  LRCN 
- Table 3 presents action recognition accuracy of C3D compared with the two baselines and current best methods.
- Imagenet baseline performs reasonably well which is just 1.2% below state-of-the-art method , but 10.8% worse than C3D due to lack of motion modeling.
- C3D obtains 22.3% accuracy and outperforms  by 10.3% with only linear SVM where the comparing method used RBF-kernel on strong SIFT-RANSAC feature matching.
- Since C3D is trained only on Sports1M videos without any fine-tuning while Imagenet is fully trained on 1000 object categories, the authors did not expect C3D iDT Brox’s Brox’s C3D
- In this work the authors try to address the problem of learning spatiotemporal features for videos using 3D ConvNets trained on large-scale video datasets.
- The authors showed that C3D can model appearance and motion information simultaneously and outperforms the 2D ConvNet features on various video analysis tasks.
- The authors demonstrated that C3D features with a linear classifier can outperform or approach current best methods on different video analysis benchmarks.
- C3D source code and pre-trained model are available at http://vlg.cs.dartmouth.edu/c3d
- The proposed C3D features are efficient, compact, and extremely simple to use.
- Table1: C3D compared to best published results. C3D outperforms all previous best reported methods on a range of benchmarks except for Sports-1M and UCF101. On UCF101, we report accuracy for two groups of methods. The first set of methods use only RGB frame inputs while the second set of methods (in parentheses) use all possible features (e.g. optical flow, improved Dense Trajectory)
- Table2: Sports-1M classification result. C3D outperforms [<a class="ref-link" id="c18" href="#r18">18</a>] by 5% on top-5 video-level accuracy. (*)We note that the method of [<a class="ref-link" id="c29" href="#r29">29</a>] uses long clips, thus its clip-level accuracy is not directly comparable to that of C3D and DeepVideo
- Table3: Action recognition results on UCF101. C3D compared with baselines and current state-of-the-art methods. Top: simple features with linear SVM; Middle: methods taking only RGB
- Table4: Action similarity labeling result on ASLAN. C3D significantly outperforms
- Table5: Scene recognition accuracy. C3D using a simple linear SVM outperforms current methods on Maryland and YUPENN
- Table6: Runtime analysis on UCF101. C3D is 91x faster than improved dense trajectories [<a class="ref-link" id="c44" href="#r44">44</a>] and 274x faster than Brox’s GPU
- Videos have been studied by the computer vision community for decades. Over the years various problems like action recognition , anomaly detection , video retrieval , event and action detection [30, 17], and many more have been proposed. Considerable portion of these works are about video representations. Laptev and Lindeberg  proposed spatio-temporal interest points (STIPs) by extending Harris corner detectors to 3D. SIFT and HOG are also extended into SIFT-3D  and HOG3D  for action recognition. Dollar et al proposed Cuboids features for behavior recognition . Sadanand and Corso built ActionBank for action recognition . Recently, Wang et al proposed improved Dense Trajectories (iDT)  which is currently the state-of-the-art hand-crafted feature. The iDT descriptor is an interesting example showing that temporal signals could be handled differently from that of spatial signal. Instead of extending Harris corner detector into 3D, it starts with densely-sampled feature points in video frames and uses optical flows to track them. For each tracker corner different hand-crafted features are extracted along the trajectory. Despite its good performance, this method is computationally intensive and becomes intractable on largescale datasets.
- M. Bendersky, L. Garcia-Pueyo, J. Harmsen, V. Josifovski, and D. Lepikhin. Up next: retrieval methods for large scale related video suggestion. In ACM SIGKDD, pages 1769–1778, 2014. 2
- O. Boiman and M. Irani. Detecting irregularities in images and in video. IJCV, 2007. 1, 2
- T. Brox and J. Malik. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE TPAMI, 33(3):500– 513, 2011. 8
- K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In CVPR, 2012. 8
- P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In Proc. ICCV VS-PETS, 2002
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CoRR, abs/1411.4389, 2014. 6
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2013. 2, 7
- C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spacetime forests with complementary features for dynamic scene recognition. In BMVC, 2013. 8
- C. Feichtenhofer, A. Pinz, and R. P. Wildes. Bags of spacetime energies for dynamic scene recognition. In CVPR, 2014. 2, 8
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013. 2
- Y. Hanani, N. Levy, and L. Wolf. Evaluating new variants of motion interchange patterns. In CVPR workshop, 2013. 7
- A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler. Learning human pose estimation features with convolutional networks. In ICLR, 2014. 2
- A. Jain, J. Tompson, Y. LeCun, and C. Bregler. Modeep: A deep learning framework using motion features for human pose estimation. In ACCV, 2014. 2
- V. Jain, B. Bollmann, M. Richardson, D. Berger, M. Helmstaedter, K. Briggman, W. Denk, J. Bowden, J. Mendenhall, W. Abraham, K. Harris, N. Kasthuri, K. Hayworth, R. Schalek, J. Tapia, J. Lichtman, and H. Seung. Boundary learning by optimization with topological constraints. In CVPR, 2010. 2
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013. 1, 2
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 1, 5
- Y. Jiang, J. Liu, A. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes, 2014. 2
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
- A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008. 2
- O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, 2012. 7
- O. Kliper-Gross, T. Hassner, and L. Wolf. The action similarity labeling challenge. TPAMI, 2012. 7
- O. Kliper-Grossa, T. Hassner, and L. Wolf. The one shot similarity metric learning for action recognition. In Workshop on SIMBAD, 2011. 7
- D. B. Kris M. Kitani, Brian D. Ziebart and M. Hebert. Activity forecasting. In ECCV, 2012. 1
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2
- Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, abs/1411.6660, 2014. 2, 6, 7
- I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003. 1, 2
- Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
- Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. Brain Theory and Neural Networks, 1995. 2
- J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015. 2, 4, 5, 6, 7
- P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. Smeaton, and G. Quenot. Trecvid’14–an overview of the goals, tasks, data, evaluation and metrics. In TRECVID, 2014. 2
- X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR, abs/1405.4506, 2014. 2, 6, 7
- X. Ren and M. Philipose. Egocentric recognition of handled objects: Benchmark and analysis. In Egocentric Vision workshop, 2009. 2, 8
- S. Sadanand and J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012. 2
- P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In ACM MM, 2007. 2
- N. Shroff, P. K. Turaga, and R. Chellappa. Moving vistas: Exploiting motion for describing scenes. In CVPR, 2010. 8
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 2, 6, 7, 8
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 3, 4
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 5, 7
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015. 2, 6
- G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, pages 140–153. Springer, 2010. 2
- C. Theriault, N. Thome, and M. Cord. Dynamic scene classification: Learning motion descriptors with slow features analysis. In CVPR, 2013. 8
- S. Turaga, J. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and S. Seung. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Comp., 2010. 2
- L. van der Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 9(2579-2605):85, 2008. 6, 7
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 2, 5, 7, 8
- Q. P. X. Peng, Y. Qiao and Q. Wang. Large margin dimensionality reduction for action similarity labeling. IEEE Signal Processing Letter, 2014. 7
- M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 5, 6
- N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In CVPR, 2014. 1
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014. 1