AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We found that Convolutional Neural Networks architectures are capable of learning powerful features from weakly-labeled data that far surpass featurebased methods in performance and that these benefits are surprisingly robust to details of the connectivity of the architectures in...

Large-Scale Video Classification with Convolutional Neural Networks

Computer Vision and Pattern Recognition, pp.1725-1732, (2014)

被引用5997|浏览862
EI WOS
下载 PDF 全文
引用
微博一下

摘要

Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approache...更多

代码

数据

0
简介
  • Images and videos have become ubiquitous on the internet, which has encouraged the development of algorithms that can analyze their semantic content for various applications, including search and summarization.
  • The key enabling factors behind these results were techniques for scaling up the networks to tens of millions of parameters and massive labeled datasets that can support the learning process.
  • Under these conditions, CNNs have been shown to learn powerful and interpretable image features [28].
  • There are several challenges to extending and applying CNNs in this setting
重点内容
  • Images and videos have become ubiquitous on the internet, which has encouraged the development of algorithms that can analyze their semantic content for various applications, including search and summarization
  • We are interested in answering the following questions: what temporal connectivity pattern in a Convolutional Neural Networks architecture is best at taking advantage of local motion information present in the video? How does the additional motion information influence the predictions of a Convolutional Neural Networks and how much does it improve performance overall? We examine these questions empirically by evaluating multiple Convolutional Neural Networks architectures that each take a different approach to combining information across the time domain
  • We studied the performance of convolutional neural networks in large-scale video classification
  • We found that Convolutional Neural Networks architectures are capable of learning powerful features from weakly-labeled data that far surpass featurebased methods in performance and that these benefits are surprisingly robust to details of the connectivity of the architectures in time
  • An alternative theory is that more careful treatment of camera motion may be necessary, but this requires significant changes to a Convolutional Neural Networks architecture that we leave for future work
  • We identified mixed-resolution architectures that consist of a low-resolution context and a highresolution fovea stream as an effective way of speeding up Convolutional Neural Networks without sacrificing accuracy
方法
  • The Sports-1M dataset consists of 1 million YouTube videos annotated with 487 classes.
  • The authors' dataset contains 6 different types of bowling, 7 different types of American football and 23 types of billiards.
  • The annotations are produced automatically by analyzing the text metadata surrounding the videos.
  • A video tagged as soccer may contain several shots of the scoreboard, interviews, news anchors, the crowd, etc
结果
  • The authors first present results on the Sports-1M dataset and qualitatively analyze the learned features and network predictions.
  • Group mAP from scratch mAP fine-tune top 3
  • All groups these were not available and the authors cannot guarantee that the Sports-1M dataset has no overlap with UCF-101.
  • These concerns are somewhat mitigated as the authors only use a few sampled clips from every video.
  • The authors use the Slow Fusion network in the UCF-101 experiments as it provides the best performance on Sports-1M.
  • Training the entire network from scratch consistently leads to massive overfitting and dismal performance
结论
  • The authors studied the performance of convolutional neural networks in large-scale video classification.
  • The authors found that CNN architectures are capable of learning powerful features from weakly-labeled data that far surpass featurebased methods in performance and that these benefits are surprisingly robust to details of the connectivity of the architectures in time.
  • The authors' results indicate that while the performance is not sensitive to the architectural details of the connectivity in time, a Slow Fusion model consistently performs better than the early and late fusion alternatives.
  • The authors identified mixed-resolution architectures that consist of a low-resolution context and a highresolution fovea stream as an effective way of speeding up CNNs without sacrificing accuracy
表格
  • Table1: Results on the 200,000 videos of the Sports-1M test set. Hit@k values indicate the fraction of test samples that contained at least one of the ground truth labels in the top k predictions
  • Table2: Classes for which a (motion-aware) Slow Fusion
  • Table3: Results on UCF-101 for various Transfer Learning approaches using the Slow Fusion network
  • Table4: Mean Average Precision of the Slow Fusion network on UCF-101 classes broken down by category groups
Download tables as Excel
相关工作
  • The standard approach to video classification [26, 16, 21, 17] involves three major stages: First, local visual features that describe a region of the video are extracted either densely [25] or at a sparse set of interest points [12, 8]. Next, the features get combined into a fixed-sized videolevel description. One popular approach is to quantize all features using a learned k-means dictionary and accumulate the visual words over the duration of the video into histograms of varying spatio-temporal positions and extents [13]. Lastly, a classifier (such as an SVM) is trained on the resulting ”bag of words” representation to distinguish among the visual classes of interest.

    Convolutional Neural Networks [15] are a biologicallyinspired class of deep learning models that replace all three stages with a single neural network that is trained end to end from raw pixel values to classifier outputs. The spatial structure of images is explicitly taken advantage of for regularization through restricted connectivity between layers (local filters), parameter sharing (convolutions) and special local invariance-building neurons (max pooling). Thus, these architectures effectively shift the required engineering from feature design and accumulation strategies to design of the network connectivity structure and hyperparameter choices. Due to computational constraints, CNNs have until recently been applied to relatively small scale image recognition problems (on datasets such as MNIST, CIFAR10/100, NORB, and Caltech-101/256), but improvements on GPU hardware have enabled CNNs to scale to networks of millions of parameters, which has in turn led to significant improvements in image classification[11], object detection [20, 9], scene labeling [3], indoor segmentation [4] and house number digit classification [19]. Additionally, features learned by large networks trained on ImageNet [7] have been shown to yield state-of-the-art performance across many standard image recognition datasets when classified with an SVM, even with no fine-tuning [18].
基金
  • Provides an extensive empirical evaluation of CNNs on largescale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes
  • Studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training
  • Studies the generalization performance of our best model by retraining the top layers on the UCF101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model
  • Studies the performance of CNNs in large-scale video classification, where the networks have access to not only the appearance information present in single, static images, but their complex temporal evolution
  • Are interested in answering the following questions: what temporal connectivity pattern in a CNN architecture is best at taking advantage of local motion information present in the video? How does the additional motion information influence the predictions of a CNN and how much does it improve performance overall? examines these questions empirically by evaluating multiple CNN architectures that each take a different approach to combining information across the time domain
引用论文
  • M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding, pages 29– 39. Springer, 2012, 3
    Google ScholarLocate open access versionFindings
  • D. Ciresan, A. Giusti, J. Schmidhuber, et al. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, 2011
    Google ScholarLocate open access versionFindings
  • L. N. Clement Farabet, Camille Couprie and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 35(8), 2011, 2
    Google ScholarLocate open access versionFindings
  • C. Couprie, C. Farabet, L. Najman, and Y. LeCun. Indoor semantic segmentation using depth information. Internatinal Conference on Learning Representation, 2013. 2
    Google ScholarFindings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, 2005
    Google ScholarLocate open access versionFindings
  • J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012. 4
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 2
    Google ScholarLocate open access versionFindings
  • P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2, 5
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. PAMI, 35(1):221– 231, 2013. 2, 3
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2, 3, 4
    Google ScholarLocate open access versionFindings
  • I. Laptev. On space-time interest points. IJCV, 64(2-3):107– 123, 2005. 2
    Google ScholarLocate open access versionFindings
  • I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 2
    Google ScholarLocate open access versionFindings
  • Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 1, 2
    Google ScholarLocate open access versionFindings
  • J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. In CVPR, 2009. 2
    Google ScholarLocate open access versionFindings
  • J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, pages 392–405. Springer, 2010. 2
    Google ScholarLocate open access versionFindings
  • A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. arXiv preprint arXiv:1403.6382, 2014. 1, 2
    Findings
  • P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In ICPR, 2012. 2
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. 1, 2
    Findings
  • J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003. 2
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 2, 7
    Findings
  • G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV. Springer, 2010. 2
    Google ScholarFindings
  • M. Varma and A. Zisserman. A statistical approach to texture classification from single images. IJCV, 62(1-2):61–81, 2005. 5
    Google ScholarLocate open access versionFindings
  • H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR. IEEE, 2011. 2, 8
    Google ScholarLocate open access versionFindings
  • H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009. 2
    Google ScholarLocate open access versionFindings
  • W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In CVPR, 2011. 5
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013. 1, 3
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科