Beyond Short Snippets: Deep Networks for Video Classification

IEEE Conference on Computer Vision and Pattern Recognition, 2015.

Cited by: 1745|Bibtex|Views212
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We presented two video-classification methods capable of aggregating frame-level Convolutional neural networks outputs into video-level predictions: Feature Pooling methods which max-pool local information through time and Long Short Term Memory whose hidden state evolves with ea...

Abstract:

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than ...More

Code:

Data:

0
Introduction
  • Convolutional Neural Networks have proven highly successful at static image recognition problems such as the MNIST, CIFAR, and ImageNet Large-Scale Visual Recognition Challenge [15, 21, 28].
  • By using a hierarchy of trainable filters and feature pooling operations, CNNs are capable of automatically learning complex features required for visual object recognition tasks achieving superior performance to hand-crafted features
  • Encouraged by these positive results several approaches have been proposed recently to apply CNNs to video and action classification tasks [2, 13, 14, 19].
  • Since each individual video frame forms only a small part of the video’s story, such an approach would be using incomplete information and could confuse classes especially if there are fine-grained distinctions or portions of the video irrelevant to the action of interest
Highlights
  • Convolutional Neural Networks have proven highly successful at static image recognition problems such as the MNIST, CIFAR, and ImageNet Large-Scale Visual Recognition Challenge [15, 21, 28]
  • We investigate various feature pooling architectures that are agnostic to temporal order and we investigate Long Short Term Memory networks which are capable of learning from temporally ordered sequences
  • Long Short Term Memory Architecture In contrast to max-pooling, which produces representations which are order invariant, we propose using a recurrent neural network to explicitly consider sequences of Convolutional neural networks activations
  • We empirically evaluate the proposed architectures on the Sports-1M and UCF-101 datasets with the goals of investigating the performance of the proposed architectures, quantifying the effect of the number of frames and frame rates on classification performance, and understanding the importance of motion information through optical flow models
  • Our 120 frames model improves upon previous work [19] (82.6% vs 73.0%) when considering models that learn directly from raw frames without optical flow information
  • We presented two video-classification methods capable of aggregating frame-level Convolutional neural networks outputs into video-level predictions: Feature Pooling methods which max-pool local information through time and Long Short Term Memory whose hidden state evolves with each subsequent frame
Methods
  • The same network with fine tuning achieves 69.5 Hit@1
  • Note that these results do not use data augmentation and classify the entire 300 seconds of a video.
  • Our 120 frames model improves upon previous work [19] (82.6% vs 73.0%) when considering models that learn directly from raw frames without optical flow information
  • This is a direct result of considering larger context within a video, even when the frames within a short clip are highly similar to each other.
  • This results from UCF-101 videos being better centered, less shaky, and better trimmed to the action in question than the average YouTube video
Results
  • The authors empirically evaluate the proposed architectures on the Sports-1M and UCF-101 datasets with the goals of investigating the performance of the proposed architectures, quantifying the effect of the number of frames and frame rates on classification performance, and understanding the importance of motion information through optical flow models.

    4.1.
  • The authors empirically evaluate the proposed architectures on the Sports-1M and UCF-101 datasets with the goals of investigating the performance of the proposed architectures, quantifying the effect of the number of frames and frame rates on classification performance, and understanding the importance of motion information through optical flow models.
  • The videos in this dataset are unconstrained
  • This means that the camera movements are not guaranteed to be well-behaved, which means that unlike UCF-101, where camera motion is constrained, the optical flow quality varies wildly between videos
Conclusion
  • The authors presented two video-classification methods capable of aggregating frame-level CNN outputs into video-level predictions: Feature Pooling methods which max-pool local information through time and LSTM whose hidden state evolves with each subsequent frame.
  • Both methods are motivated by the idea that incorporating information across longer video sequences will enable better video classification.
  • The resulting networks achieve state-of-the-art performance on both the Sports-1M and UCF-101 benchmarks, supporting the idea that learning should take place over the entire video rather than short clips
Summary
  • Introduction:

    Convolutional Neural Networks have proven highly successful at static image recognition problems such as the MNIST, CIFAR, and ImageNet Large-Scale Visual Recognition Challenge [15, 21, 28].
  • By using a hierarchy of trainable filters and feature pooling operations, CNNs are capable of automatically learning complex features required for visual object recognition tasks achieving superior performance to hand-crafted features
  • Encouraged by these positive results several approaches have been proposed recently to apply CNNs to video and action classification tasks [2, 13, 14, 19].
  • Since each individual video frame forms only a small part of the video’s story, such an approach would be using incomplete information and could confuse classes especially if there are fine-grained distinctions or portions of the video irrelevant to the action of interest
  • Methods:

    The same network with fine tuning achieves 69.5 Hit@1
  • Note that these results do not use data augmentation and classify the entire 300 seconds of a video.
  • Our 120 frames model improves upon previous work [19] (82.6% vs 73.0%) when considering models that learn directly from raw frames without optical flow information
  • This is a direct result of considering larger context within a video, even when the frames within a short clip are highly similar to each other.
  • This results from UCF-101 videos being better centered, less shaky, and better trimmed to the action in question than the average YouTube video
  • Results:

    The authors empirically evaluate the proposed architectures on the Sports-1M and UCF-101 datasets with the goals of investigating the performance of the proposed architectures, quantifying the effect of the number of frames and frame rates on classification performance, and understanding the importance of motion information through optical flow models.

    4.1.
  • The authors empirically evaluate the proposed architectures on the Sports-1M and UCF-101 datasets with the goals of investigating the performance of the proposed architectures, quantifying the effect of the number of frames and frame rates on classification performance, and understanding the importance of motion information through optical flow models.
  • The videos in this dataset are unconstrained
  • This means that the camera movements are not guaranteed to be well-behaved, which means that unlike UCF-101, where camera motion is constrained, the optical flow quality varies wildly between videos
  • Conclusion:

    The authors presented two video-classification methods capable of aggregating frame-level CNN outputs into video-level predictions: Feature Pooling methods which max-pool local information through time and LSTM whose hidden state evolves with each subsequent frame.
  • Both methods are motivated by the idea that incorporating information across longer video sequences will enable better video classification.
  • The resulting networks achieve state-of-the-art performance on both the Sports-1M and UCF-101 benchmarks, supporting the idea that learning should take place over the entire video rather than short clips
Tables
  • Table1: Conv-Pooling outperforms all other featurepooling architectures (Figure 2) on Sports-1M using a 120-
  • Table2: GoogLeNet outperforms AlexNet alone and when paired with both Conv-Pooling and LSTM. Experiments performed on Sports-1M using 30-frame Conv-Pooling and LSTM models. Note that the (fc) models updated only the final layers while training and did not use data augmentation
  • Table3: Effect of the number of frames in the model. Both LSTM and Conv-Pooling models use GoogLeNet CNN
  • Table4: Optical flow is noisy on Sports-1M and if used alone, results in lower performance than equivalent imagemodels. However, if used in conjunction with raw image features, optical flow benefits LSTM. Experiments performed on 30-frame models using GoogLeNet CNNs
  • Table5: Leveraging global video-level descriptors, LSTM and Conv-Pooling achieve a 20% increase in Hit@1 compared to prior work on the in Sports-1M dataset. Hit@1, and Hit@5 are computed at video level
  • Table6: Lower frame rates produce higher UCF-101 accuracy for 30-frame Conv-Pooling models
  • Table7: UCF-101 results. The bold-face numbers represent results that are higher than previously reported results
Download tables as Excel
Related work
  • Traditional video recognition research has been extremely successful at obtaining global video descriptors that encode both appearance and motion information in order to provide state-of-art results on a large number of video datasets. These approaches are able to aggregate local appearance and motion information using hand-crafted features such as Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), Motion Boundary Histogram (MBH) around spatio-temporal interest points [17], in a dense grid [24] or around dense point trajectories [12, 16, 22, 23] obtained through optical flow based tracking. These features are then encoded in order to produce a global video-level descriptor through bag of words (BoW) [17] or Fisher vector based encodings [23].

    However, no previous attempts at CNN-based video recognition use both motion information and a global description of the video: Several approaches [2, 13, 14] employ 3D-convolution over short video clips - typically just a few seconds - to learn motion features from raw frames implicitly and then aggregate predictions at the video level. Karpathy et al [14] demonstrate that their network is just marginally better than single frame baseline, which indicates learning motion features is difficult. In view of this, Simonyan et al [19] directly incorporate motion information from optical flow, but only sample up to 10 consecutive frames at inference time. The disadvantage of such local approaches is that each frame/clip may contain only a small part of the full video’s information, resulting in a network that performs no better than the naıve approach of classifying individual frames.
Reference
  • M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Action classification in soccer videos with long short-term memory recurrent neural networks. In Proc. ICANN, pages 154–159, Thessaloniki, Greece, 2010. 2
    Google ScholarLocate open access versionFindings
  • M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential Deep Learning for Human Action Recognition. In 2nd International Workshop on Human Behavior Understanding (HBU), pages 29–39, Nov. 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. on Neural Networks, 5(2):157–166, 1994. 2
    Google ScholarLocate open access versionFindings
  • Y.-L. Boureau, J. Ponce, and Y. Lecun. A theoretical analysis of feature pooling in visual recognition. In Proc. ICML, pages 111–118, Haifa, Israel, 2010. 3
    Google ScholarLocate open access versionFindings
  • S. Fernandez, A. Graves, and J. Schmidhuber. Phoneme recognition in TIMIT with BLSTM-CTC. CoRR, abs/0804.3269, 2008. 2
    Findings
  • F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. JMLR, 3:115–143, 2002. 4
    Google ScholarLocate open access versionFindings
  • A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML, pages 1764–1772, Beijing, China, 2014. 2
    Google ScholarLocate open access versionFindings
  • A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. PAMI, 31(5):855–868, 2009. 2
    Google ScholarLocate open access versionFindings
  • A. Graves, A.-R. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013. 2, 4
    Findings
  • A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Proc. NIPS, pages 545–552, Vancouver, B.C., Canada, 2008. 2
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computing, 9(8):1735–1780, Nov. 1997. 2
    Google ScholarLocate open access versionFindings
  • M. Jain, H. Jegou, and P. Bouthemy. Better exploiting motion for better action recognition. In Proc. CVPR, pages 2555–2562, Portland, Oregon, USA, 2013. 2, 3
    Google ScholarLocate open access versionFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. PAMI, 35(1):221–231, Jan. 201, 2
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proc. CVPR, pages 1725–1732, Columbus, Ohio, USA, 201, 2, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, pages 1097–1105, Lake Tahoe, Nevada, USA, 2012. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In Proc. ICCV, pages 2556–2563, Barcelona, Spain, 2011. 2
    Google ScholarLocate open access versionFindings
  • I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Proc. CVPR, pages 1–8, Anchorage, Alaska, USA, 2008. 2, 3
    Google ScholarLocate open access versionFindings
  • S. Reiter, B. Schuller, and G. Rigoll. A combined LSTMRNN - HMM - approach for meeting event segmentation and recognition. In Proc. ICASSP, pages 393–396, Toulouse, France, 2006. 2
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proc. NIPS, pages 568–576, Montreal, Canada, 2014. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01, 2012. 7
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 1, 3, 4
    Findings
  • H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Proc. CVPR, pages 3169– 3176, Washington, DC, USA, 2011. 2
    Google ScholarLocate open access versionFindings
  • H. Wang and C. Schmid. Action Recognition with Improved Trajectories. In Proc. ICCV, pages 3551–3558, Sydney, Australia, 2013. 2, 5, 8
    Google ScholarLocate open access versionFindings
  • H. Wang, M. M. Ullah, A. Klser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In Proc. BMVC, pages 1–11, 2009. 2, 3
    Google ScholarLocate open access versionFindings
  • M. Wllmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll. LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vision Computing, 31(2):153–163, 2013. 2
    Google ScholarLocate open access versionFindings
  • C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l1 optical flow. In Proceedings of the 29th DAGM Conference on Pattern Recognition, pages 214–223, Berlin, Heidelberg, 2007. Springer-Verlag. 5
    Google ScholarLocate open access versionFindings
  • W. Zaremba and I. Sutskever. Learning to execute. CoRR, abs/1410.4615, 2014. 2
    Findings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc. ECCV, pages 818–833, Zurich, Switzerland, 2014. 1, 3
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments