We found that Convolutional Neural Networks architectures are capable of learning powerful features from weakly-labeled data that far surpass featurebased methods in performance and that these benefits are surprisingly robust to details of the connectivity of the architectures in...
Large-Scale Video Classification with Convolutional Neural Networks
Computer Vision and Pattern Recognition, pp.1725-1732, (2014)
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approache...更多
下载 PDF 全文
- Images and videos have become ubiquitous on the internet, which has encouraged the development of algorithms that can analyze their semantic content for various applications, including search and summarization.
- The key enabling factors behind these results were techniques for scaling up the networks to tens of millions of parameters and massive labeled datasets that can support the learning process.
- Under these conditions, CNNs have been shown to learn powerful and interpretable image features .
- There are several challenges to extending and applying CNNs in this setting
- Images and videos have become ubiquitous on the internet, which has encouraged the development of algorithms that can analyze their semantic content for various applications, including search and summarization
- We are interested in answering the following questions: what temporal connectivity pattern in a Convolutional Neural Networks architecture is best at taking advantage of local motion information present in the video? How does the additional motion information influence the predictions of a Convolutional Neural Networks and how much does it improve performance overall? We examine these questions empirically by evaluating multiple Convolutional Neural Networks architectures that each take a different approach to combining information across the time domain
- We studied the performance of convolutional neural networks in large-scale video classification
- We found that Convolutional Neural Networks architectures are capable of learning powerful features from weakly-labeled data that far surpass featurebased methods in performance and that these benefits are surprisingly robust to details of the connectivity of the architectures in time
- An alternative theory is that more careful treatment of camera motion may be necessary, but this requires significant changes to a Convolutional Neural Networks architecture that we leave for future work
- We identified mixed-resolution architectures that consist of a low-resolution context and a highresolution fovea stream as an effective way of speeding up Convolutional Neural Networks without sacrificing accuracy
- The Sports-1M dataset consists of 1 million YouTube videos annotated with 487 classes.
- The authors' dataset contains 6 different types of bowling, 7 different types of American football and 23 types of billiards.
- The annotations are produced automatically by analyzing the text metadata surrounding the videos.
- A video tagged as soccer may contain several shots of the scoreboard, interviews, news anchors, the crowd, etc
- The authors first present results on the Sports-1M dataset and qualitatively analyze the learned features and network predictions.
- Group mAP from scratch mAP fine-tune top 3
- All groups these were not available and the authors cannot guarantee that the Sports-1M dataset has no overlap with UCF-101.
- These concerns are somewhat mitigated as the authors only use a few sampled clips from every video.
- The authors use the Slow Fusion network in the UCF-101 experiments as it provides the best performance on Sports-1M.
- Training the entire network from scratch consistently leads to massive overfitting and dismal performance
- The authors studied the performance of convolutional neural networks in large-scale video classification.
- The authors found that CNN architectures are capable of learning powerful features from weakly-labeled data that far surpass featurebased methods in performance and that these benefits are surprisingly robust to details of the connectivity of the architectures in time.
- The authors' results indicate that while the performance is not sensitive to the architectural details of the connectivity in time, a Slow Fusion model consistently performs better than the early and late fusion alternatives.
- The authors identified mixed-resolution architectures that consist of a low-resolution context and a highresolution fovea stream as an effective way of speeding up CNNs without sacrificing accuracy
- Table1: Results on the 200,000 videos of the Sports-1M test set. Hit@k values indicate the fraction of test samples that contained at least one of the ground truth labels in the top k predictions
- Table2: Classes for which a (motion-aware) Slow Fusion
- Table3: Results on UCF-101 for various Transfer Learning approaches using the Slow Fusion network
- Table4: Mean Average Precision of the Slow Fusion network on UCF-101 classes broken down by category groups
- The standard approach to video classification [26, 16, 21, 17] involves three major stages: First, local visual features that describe a region of the video are extracted either densely  or at a sparse set of interest points [12, 8]. Next, the features get combined into a fixed-sized videolevel description. One popular approach is to quantize all features using a learned k-means dictionary and accumulate the visual words over the duration of the video into histograms of varying spatio-temporal positions and extents . Lastly, a classifier (such as an SVM) is trained on the resulting ”bag of words” representation to distinguish among the visual classes of interest.
Convolutional Neural Networks  are a biologicallyinspired class of deep learning models that replace all three stages with a single neural network that is trained end to end from raw pixel values to classifier outputs. The spatial structure of images is explicitly taken advantage of for regularization through restricted connectivity between layers (local filters), parameter sharing (convolutions) and special local invariance-building neurons (max pooling). Thus, these architectures effectively shift the required engineering from feature design and accumulation strategies to design of the network connectivity structure and hyperparameter choices. Due to computational constraints, CNNs have until recently been applied to relatively small scale image recognition problems (on datasets such as MNIST, CIFAR10/100, NORB, and Caltech-101/256), but improvements on GPU hardware have enabled CNNs to scale to networks of millions of parameters, which has in turn led to significant improvements in image classification, object detection [20, 9], scene labeling , indoor segmentation  and house number digit classification . Additionally, features learned by large networks trained on ImageNet  have been shown to yield state-of-the-art performance across many standard image recognition datasets when classified with an SVM, even with no fine-tuning .
- Provides an extensive empirical evaluation of CNNs on largescale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes
- Studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training
- Studies the generalization performance of our best model by retraining the top layers on the UCF101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model
- Studies the performance of CNNs in large-scale video classification, where the networks have access to not only the appearance information present in single, static images, but their complex temporal evolution
- Are interested in answering the following questions: what temporal connectivity pattern in a CNN architecture is best at taking advantage of local motion information present in the video? How does the additional motion information influence the predictions of a CNN and how much does it improve performance overall? examines these questions empirically by evaluating multiple CNN architectures that each take a different approach to combining information across the time domain
- M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding, pages 29– 39. Springer, 2012, 3
- D. Ciresan, A. Giusti, J. Schmidhuber, et al. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, 2011
- L. N. Clement Farabet, Camille Couprie and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 35(8), 2011, 2
- C. Couprie, C. Farabet, L. Najman, and Y. LeCun. Indoor semantic segmentation using depth information. Internatinal Conference on Learning Representation, 2013. 2
- N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, 2005
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012. 4
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 2
- P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2, 5
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 2
- S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. PAMI, 35(1):221– 231, 2013. 2, 3
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2, 3, 4
- I. Laptev. On space-time interest points. IJCV, 64(2-3):107– 123, 2005. 2
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 2
- Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 1, 2
- J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. In CVPR, 2009. 2
- J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, pages 392–405. Springer, 2010. 2
- A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. arXiv preprint arXiv:1403.6382, 2014. 1, 2
- P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In ICPR, 2012. 2
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. 1, 2
- J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003. 2
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 2, 7
- G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV. Springer, 2010. 2
- M. Varma and A. Zisserman. A statistical approach to texture classification from single images. IJCV, 62(1-2):61–81, 2005. 5
- H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR. IEEE, 2011. 2, 8
- H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009. 2
- W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In CVPR, 2011. 5
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013. 1, 3