AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We conducted a comprehensive comparative analysis of several representative convolutional neural networks-based video action recognition approaches with different backbones and temporal aggregations

Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Recognition.

CVPR, pp.6165-6175, (2021)

Cited by: 0|Views141
EI
Full Text
Bibtex
Weibo

Abstract

In recent years, a number of approaches based on 2D CNNs and 3D CNNs have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out an in-depth comparative analysis to better understand the differences between these approaches and the progress made by ...More

Code:

Data:

0
Introduction
  • With the recent advances in convolutional neural networks (CNNs) [45, 19] and the availability of large-scale video datasets [25, 32], deep learning approaches have dominated the field of video action recognition by using 2D CNNs [52, 29, 5] or 3D CNNs [2, 18, 7] or both [30, 44].
  • The 2D CNNs perform temporal modeling independent of 2D spatial convolutions while their 3D counterparts learn space and time information jointly by 3D convolution.
  • These methods have achieved state-of-the-art performance on multiple large-scale benchmarks such as Kinetics [25] and Something-Something [16].
  • Variations in training and evaluation protocols, model inputs and pretrained models from approach to approach further confound the comparison
Highlights
  • With the recent advances in convolutional neural networks (CNNs) [45, 19] and the availability of large-scale video datasets [25, 32], deep learning approaches have dominated the field of video action recognition by using 2D CNNs [52, 29, 5] or 3D CNNs [2, 18, 7] or both [30, 44]
  • We study the effects of several factors on 2D and 3D models including i) Input sampling, ii) Backbone network, iii) Input length; iv) Temporal pooling, and v) Temporal aggregation
  • Our results show that I3D remains as one of the most competitive approaches for action recognition, and that the progress of accuracy on action recognition is largely due to the use of more powerful backbone networks
  • The only exception is 3D models (I3D) on Mini-Kinetics, where dense sampling is 1∼2% better than uniform sampling
  • In this paper, we conducted a comprehensive comparative analysis of several representative CNN-based video action recognition approaches with different backbones and temporal aggregations
Results
  • Experimental Results and Analysis

    the authors provide detailed analysis of the performance of 2D and 3D models (Sec. 5.1), their SOTA results and transferability (Sec. 5.2) and their spatio-temporal effects (Sec. 5.3) as well as the temporal dynamics of datasets (Sec. 5.4).
  • The authors experiment with 3 backbones (InceptionV1, ResNet and ResNet50) and two scenarios (w/ and w/o temporal pooling) on three datasets.
  • Based on these models, the authors study the effects of several factors on 2D and 3D models including i) Input sampling, ii) Backbone network, iii) Input length; iv) Temporal pooling, and v) Temporal aggregation.
Conclusion
  • The authors conducted a comprehensive comparative analysis of several representative CNN-based video action recognition approaches with different backbones and temporal aggregations.
  • The authors' extensive analysis enables better understanding of the differences and spatio-temporal effects of 2D-CNN and 3D-CNN approaches.
  • It provides significant insights with regard to the efficacy of spatiotemporal representations for action recognition
Tables
  • Table1: Table 1
  • Table2: Overview of datasets
  • Table3: Training protocol
  • Table4: Video-level model accuracies on Mini-Kinetics and Mini-
  • Table5: Performance of different temporal aggregation strategies w/o temporal pooling
  • Table6: Performance of SOTA models
  • Table7: Top-1 Acc. of Transferability study from Kinetics
  • Table8: Effects of spatiotemporal modeling
  • Table9: The class overlap ratio, recognition accuracies and average temporal gains (in parenthesis) of the temporal and static datasets identified by human and machine
  • Table10: Results of temporality analysis on Kinetics by removing temporal classes
Download tables as Excel
Related work
  • Video understanding is a challenging problem with great application potential. Over the last years video understanding has made rapid progress with the introduction of a number of large-scale video datasets such as such as Kinetics [25], Sports1M [24], Moments-In-Time [32], and YouTube-8M [1]. A number of models introduced recently have emphasized the need to efficiently model spatiotemporal information for action recognition. Most successful deep architectures for action recognition are usually based on two-stream model [41], processing RGB frames and optical-flow in two separate Convolutional Neural Networks (CNNs) with a late fusion in the upper layers [24]. Over the last few years, two-stream approaches have been used in different action recognition methods [3, 4, 15, 58, 43, 49, 54, 50, 8, 9]. Another straightforward but popular approach is the use of 2D-CNN to extract frame-level features and then model the temporal causality. For example, TSN [52] proposed the consensus module to aggregate the features; on the other hand, TRN [59] used a bag of features idea to model the relationship between frames. While TSM [29] shifts part of the channels along the temporal dimension, thereby allowing for information to be exchanged among neighboring frames, TAM [5] is based on depthwise 1 × 1 convolutions to capture temporal dependencies across frames effectively. Different methods for temporal aggregation of feature descriptors has also been proposed [10, 28, 57, 50, 36, 13, 12]. More complex approaches have also been investigated for capturing long-range dependencies, e.g. in the context of non-local neural networks [53].
Funding
  • This work is supported by IARPA via DOI/IBC contract number D17PC00341
Study subjects and analysis
largescale benchmark datasets: 3
In light of the need for a deep analysis of action recognition works, in this paper we provide a common ground for comparative analysis of 2D-CNN and 3D-CNN models without any bells and whistles. We conduct consistent and comprehensive experiments to compare several representative 2D-CNN and 3D-CNN methods on three largescale benchmark datasets. Our main goal is to deliver clear understanding of a) how differently 2D-CNN and 3D-CNN methods behave with regard to spatial-temporal modeling of video data; b) whether the state-of-the-art approaches enable more effective learning of spatio-temporal representations of video, as claimed in the papers; and c) the significance of temporal modeling for action recognition

popular benchmark datasets with different backbone networks: 3
We then re-implemented six representative approaches of action recognition, including I3D [2], ResNet3D [18], S3D [56], R(2+1)D [48], TSN [52] and TAM [5] in a unified framework. We trained about 300 action recognition models on three popular benchmark datasets with different backbone networks (InceptionV1, ResNet18 and ResNet50) and input frames using the same initialization and training protocol. We also develop methods to perform detailed analysis of the spatio-temporal effects of different models across backbone and network architecture

standard benchmark datasets: 3
• A unified framework for Action Recognition. We present a unified framework for 2D-CNN and 3DCNN approaches and implement several representative methods for comparative analysis on three standard benchmark datasets. • Spatio-Temporal Analysis

samples: 8
The shorter side of a video is randomly resized to the range of [256, 320] while keeping aspect ratio, and then we randomly crop a 224×224 spatial region as the training input. We trained all models for 196 epochs, using a total batch size of 1024 with 128 GPUs, i.e. 8 samples per GPU. Batch

samples: 8
None∗ 24.8G 77.50 − None∗ 48.4G 79.10 −. ∗: Those networks cannot be initialized from ImageNet due to its structure. †: Retrained by ourselves. ‡: reported by the authors of the paper normalization is computed on those 8 samples. We warm up the learning rate from 0.01 to 1.6 with 34 epochs linearly and then apply half-period cosine annealing schedule for the remaining epochs

Reference
  • Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675, 2016. 2
    Findings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 1, 2, 3, 6, 13
    Google ScholarLocate open access versionFindings
  • Guilhem Cheron, Ivan Laptev, and Cordelia Schmid. P-cnn: Pose-based cnn features for action recognition. In ICCV, pages 3218–3226, 2015. 2
    Google ScholarLocate open access versionFindings
  • Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, June 2015. 2
    Google ScholarFindings
  • Quanfu Fan, Chun-Fu (Ricarhd) Chen, Hilde Kuehne, Marco Pistoia, and David Cox. More Is Less: Learning Efficient Video Representations by Temporal Aggregation Modules. In NeurIPS, 2019. 1, 2, 3, 4, 7, 13
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In CVPR, June 2020. 3
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv:1812.03982, 2018. 1, 3
    Findings
  • Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks for video action recognition. In NeurIPS, pages 3468–3476, 2016. 2
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, pages 4768–4777, 2017. 2
    Google ScholarLocate open access versionFindings
  • Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir Ghodrati, and Tinne Tuytelaars. Modeling video evolution for action recognition. In CVPR, pages 5378–5387, 2015. 3
    Google ScholarLocate open access versionFindings
  • Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Largescale weakly-supervised pre-training for video action recognition. In CVPR, pages 12046–12055, 2019. 3
    Google ScholarLocate open access versionFindings
  • Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In CVPR, pages 244–253, 2019. 3
    Google ScholarLocate open access versionFindings
  • Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, pages 971– 980, 2017. 3
    Google ScholarLocate open access versionFindings
  • Rohit Girdhar, Du Tran, Lorenzo Torresani, and Deva Ramanan. Distinit: Learning video representations without a single labeled video. arXiv:1901.09244, 2019. 3
    Findings
  • Georgia Gkioxari and Jitendra Malik. Finding action tubes. In CVPR, pages 759–768, 202
    Google ScholarLocate open access versionFindings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 1, 4
    Google ScholarLocate open access versionFindings
  • Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Learning spatio-temporal features with 3d residual networks for action recognition. In ICCV, pages 3154–3160, 203
    Google ScholarLocate open access versionFindings
  • Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In CVPR, June 201, 2, 3
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, June 2016. 1
    Google ScholarLocate open access versionFindings
  • De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR, pages 7366–7375, 2018. 3
    Google ScholarLocate open access versionFindings
  • Noureldien Hussein, Efstratios Gavves, and Arnold W.M. Smeulders. Timeception for complex action recognition. In CVPR, June 2019. 3
    Google ScholarLocate open access versionFindings
  • Matthew Hutchinson, Siddharth Samsi, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Micheal Houle, Matthew Hubbell, Micheal Jones, Jeremy Kepner, et al. Accuracy and performance comparison of video action recognition approaches. arXiv:2008.09037, 2020. 2
    Findings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, Jan 2013. 3
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014. 2
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950, 2017. 1, 2, 4
    Findings
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. 7
    Google ScholarLocate open access versionFindings
  • Hilde Kuehne, Alexander Richard, and Juergen Gall. Weakly supervised learning of actions from transcripts. Computer Vision and Image Understanding, 163:78–89, 2017. 3
    Google ScholarLocate open access versionFindings
  • Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. Rnn fisher vectors for action recognition and image annotation. In ECCV, pages 833–850. Springer, 2016. 3
    Google ScholarLocate open access versionFindings
  • Ji Lin, Chuang Gan, and Song Han. Temporal Shift Module for Efficient Video Understanding. In ICCV, 2019. 1, 3, 7
    Google ScholarLocate open access versionFindings
  • Chenxu Luo and Alan L Yuille. Grouped spatial-temporal aggregation for efficient action recognition. In ICCV, pages 5512–5521, 2019. 1
    Google ScholarLocate open access versionFindings
  • Joanna Materzynska, Guillaume Berger, Ingo Bax, and Roland Memisevic. The jester dataset: A large-scale video dataset of human gestures. In ICCV Workshops, Oct 2019. 7
    Google ScholarLocate open access versionFindings
  • Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Yan Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE TPAMI, 2019. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, pages 4594–4602, 2016. 3
    Google ScholarLocate open access versionFindings
  • Rameswar Panda and Amit K Roy-Chowdhury. Collaborative summarization of topic-related videos. In CVPR, pages 7083–7092, 2017. 3
    Google ScholarLocate open access versionFindings
  • Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. Wtalc: Weakly-supervised temporal activity localization and classification. In ECCV, pages 563–579, 2018. 3
    Google ScholarLocate open access versionFindings
  • Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng. Action recognition with stacked fisher vectors. In ECCV, pages 581–595. Springer, 2014. 3
    Google ScholarFindings
  • Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatiotemporal representation with pseudo-3d residual networks. In ICCV, Oct 2017. 3
    Google ScholarLocate open access versionFindings
  • Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. CoRR, abs/1907.08340, 2019. 9
    Findings
  • Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058, 2016. 3
    Google ScholarLocate open access versionFindings
  • Gunnar A Sigurdsson, Olga Russakovsky, and Abhinav Gupta. What actions are needed for understanding human actions in videos? In ICCV, pages 2137–2146, 2017. 3
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014. 2
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, Mubarak Shah, Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012. 7
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, pages 843–852, 2015. 2
    Google ScholarLocate open access versionFindings
  • Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-shift networks for video action recognition. In CVPR, pages 1102–1111, 2020. 1
    Google ScholarLocate open access versionFindings
  • C Szegedy, Wei Liu, Yangqing Jia, P Sermanet, S Reed, D Anguelov, D Erhan, V Vanhoucke, and A Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015. 1
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features With 3D Convolutional Networks. In ICCV, 2015. 3
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In ICCV, October 2019. 3
    Google ScholarFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR, June 2018. 2, 3, 13
    Google ScholarLocate open access versionFindings
  • Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, pages 4534– 4542, 2015. 2
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pages 4305–4314, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly supervised action recognition and detection. In CVPR, pages 4325–4334, 2017. 3
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 2016. 1, 2, 3, 4, 7
    Google ScholarFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, June 2018. 3, 7
    Google ScholarLocate open access versionFindings
  • Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization. In ICCV, pages 3164–3172, 2015. 2
    Google ScholarLocate open access versionFindings
  • Yuxin Wu and Kaiming He. Group normalization. In ECCV, September 2018. 4
    Google ScholarLocate open access versionFindings
  • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In ECCV, Sept. 2018. 2, 3, 7, 13
    Google ScholarLocate open access versionFindings
  • Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A discriminative cnn video representation for event detection. In CVPR, pages 1798–1807, 2015. 3
    Google ScholarLocate open access versionFindings
  • Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, pages 4694–4702, 2015. 2
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In ECCV, pages 803–818, 2018. 3, 4
    Google ScholarLocate open access versionFindings
  • Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. Eco: Efficient convolutional network for online video understanding. In ECCV, pages 695–712, 2018. 3
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科