AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Existing methods cannot well handle the intrainstance variations caused by the frequent shot cuts as TV shows and movies are usually post-processed by professional editing techniques

Multi-shot Temporal Event Localization: a Benchmark

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021)

Cited by: 0|Views46
Full Text
Bibtex
Weibo

Abstract

Current developments in temporal event or action localization usually target actions captured by a single camera. However, extensive events or actions in the wild may be captured as a sequence of shots by multiple cameras at different positions. In this paper, we propose a new and challenging task called multi-shot temporal event loca...More

Code:

Data:

0
Introduction
  • Driven by the increasing number of videos generated, shared and consumed every day, video understanding has attracted greater attention in computer vision especially in recent years.
  • As one of the pillars in video understanding, temporal event localization [10, 44, 58, 75, 82, 83, 85, 86] is a challenging task that aims to predict the semantic label of an action, and in the meantime, locate its start time and end time in a long video
  • Automating this process is of great importance for many applications, e.g., security surveillance, home care, human- (a) (b) (c).
  • A basic but time-consuming step is to find the materials, i.e., the video segments of interest, while the development of multi-shot temporal event localization will enable efficient material extraction and greatly improve the productivity of video content generation
Highlights
  • Driven by the increasing number of videos generated, shared and consumed every day, video understanding has attracted greater attention in computer vision especially in recent years
  • We report an average of mean Average Precision (mAP), which is computed over the 5 IoU thresholds
  • Existing methods cannot well handle the intrainstance variations caused by the frequent shot cuts as TV shows and movies are usually post-processed by professional editing techniques
  • To enable automatic video content generation with efficient and scalable material extraction in TV shows and movies, we define a new task called multi-shot temporal event localization that aims to localize events or actions captured in multiple shots
  • A comprehensive evaluation shows that the state-of-the-art methods in this field fail to cope with frequent shot cuts and reveal the difficulty of the MUlti-Shot EventS (MUSES) dataset
  • MUSES could serve as a benchmark dataset and facilitate research in temporal event localization
Methods
  • As can be observed, removing temporal aggregation leads to a decrease of 2.8% in terms of mAP0.5 on MUSES.
  • Replacing temporal aggregation with dilated 1D convolution or deformable 1D convolution leads to a performance drop.
  • An integration of multiple temporal aggregation modules at different scales outperforms its single-scale counterparts.
  • The performance gain stems from the fact that temporal aggregation is capable of enhancing the feature coherence within each single instance to a certain extent.
  • The authors follow the gist described in Sec. 3.3 to compute the standard deviation of selfsimilarities, and find that the standard deviation is decreased from 0.16 to 0.09 after temporal aggregation is applied
Results
  • Evaluation Metrics

    To evaluate the performance of multi-shot temporal event localization, the authors employ mean Average Precision.
  • The authors report an average of mAPs, which is computed over the 5 IoU thresholds.
  • To quantity the capability of algorithms in handling shot cuts, the authors further resort to mAPs with different numbers of shots.
  • The authors divide all the instances into 3 groups according to the number of shots, and report mAPsmall, mAPmedium (10 to 20 shots), and mAPlarge.
  • In MUSES, the 3 groups make up 39.8%, 27.5% and 32.7% of all instances, respectively
Conclusion
  • Truncating TV shows and movies into concise and attractive short videos has been a popular way of increasing click-through rates in video sharing platforms, where localizing temporal segments of interest is the kick-off step.
  • Temporal event or action localization in such video sources is more or less ignored by the research community.
  • To enable automatic video content generation with efficient and scalable material extraction in TV shows and movies, the authors define a new task called multi-shot temporal event localization that aims to localize events or actions captured in multiple shots.
  • MUSES provide rich multi-shot instances with frequent shot cuts, which induces great intra-instance variations and brings new challenges to current approaches.
  • MUSES could serve as a benchmark dataset and facilitate research in temporal event localization
Summary
  • Introduction:

    Driven by the increasing number of videos generated, shared and consumed every day, video understanding has attracted greater attention in computer vision especially in recent years.
  • As one of the pillars in video understanding, temporal event localization [10, 44, 58, 75, 82, 83, 85, 86] is a challenging task that aims to predict the semantic label of an action, and in the meantime, locate its start time and end time in a long video
  • Automating this process is of great importance for many applications, e.g., security surveillance, home care, human- (a) (b) (c).
  • A basic but time-consuming step is to find the materials, i.e., the video segments of interest, while the development of multi-shot temporal event localization will enable efficient material extraction and greatly improve the productivity of video content generation
  • Objectives:

    The goal of this work is to build a large-scale dataset for temporal event localization, especially in the multi-shot scenario.
  • Methods:

    As can be observed, removing temporal aggregation leads to a decrease of 2.8% in terms of mAP0.5 on MUSES.
  • Replacing temporal aggregation with dilated 1D convolution or deformable 1D convolution leads to a performance drop.
  • An integration of multiple temporal aggregation modules at different scales outperforms its single-scale counterparts.
  • The performance gain stems from the fact that temporal aggregation is capable of enhancing the feature coherence within each single instance to a certain extent.
  • The authors follow the gist described in Sec. 3.3 to compute the standard deviation of selfsimilarities, and find that the standard deviation is decreased from 0.16 to 0.09 after temporal aggregation is applied
  • Results:

    Evaluation Metrics

    To evaluate the performance of multi-shot temporal event localization, the authors employ mean Average Precision.
  • The authors report an average of mAPs, which is computed over the 5 IoU thresholds.
  • To quantity the capability of algorithms in handling shot cuts, the authors further resort to mAPs with different numbers of shots.
  • The authors divide all the instances into 3 groups according to the number of shots, and report mAPsmall, mAPmedium (10 to 20 shots), and mAPlarge.
  • In MUSES, the 3 groups make up 39.8%, 27.5% and 32.7% of all instances, respectively
  • Conclusion:

    Truncating TV shows and movies into concise and attractive short videos has been a popular way of increasing click-through rates in video sharing platforms, where localizing temporal segments of interest is the kick-off step.
  • Temporal event or action localization in such video sources is more or less ignored by the research community.
  • To enable automatic video content generation with efficient and scalable material extraction in TV shows and movies, the authors define a new task called multi-shot temporal event localization that aims to localize events or actions captured in multiple shots.
  • MUSES provide rich multi-shot instances with frequent shot cuts, which induces great intra-instance variations and brings new challenges to current approaches.
  • MUSES could serve as a benchmark dataset and facilitate research in temporal event localization
Tables
  • Table1: Comparing MUSES with existing datasets for temporal event localization
  • Table2: Performance analysis of temporal aggregation in terms of
  • Table3: Performance evaluation of state-of-the-art methods on the newly collected MUSES dataset
  • Table4: Performance comparison on THUMOS14 in terms of
  • Table5: Performance comparison on the validation set of
Download tables as Excel
Related work
  • Our work targets temporal event or action localization by contributing a new benchmark dataset. Existing datasets are mainly built upon user-generated videos, where less professional editing is involved. For example, THUMOS14 [28] focuses on sports events. ActivityNet-1.3 [8] extends the taxonomy from sports to human daily activities and significantly increases the number of categories and samples. HACS Segments [84] shares the same lexicon as ActivityNet and further increases the size. In comparison, our dataset is based on drama videos processed by professional editing with frequent shot cuts, so the intra-instance variances are much greater.
Funding
  • Our comprehensive evaluations show that the state-of-the-art method in temporal action localization only achieves an mAP of 13.1% at IoU=0.5
  • We outperform the second-best entry, that is G-TAD [75], by an absolute improvement of 5.3% in terms of mAP at IoU=0.5
Reference
  • Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A largescale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. 3
    Findings
  • Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In ECCV, pages 256–272, 2018. 6, 7
    Google ScholarLocate open access versionFindings
  • Humam Alwassel, Fabian Caba Heilbron, and Bernard Ghanem. Action search: Spotting actions in videos and its application to temporal action localization. In ECCV, pages 251–266, 2018. 3
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Remi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. Weakly supervised action labeling in videos under ordering constraints. In ECCV, pages 628–643, 2013
    Google ScholarLocate open access versionFindings
  • S Buch, V Escorcia, B Ghanem, L Fei-Fei, and JC Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In BMVC, 2017. 3, 8
    Google ScholarLocate open access versionFindings
  • Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In CVPR, pages 6373–6382, 2017. 3
    Google ScholarLocate open access versionFindings
  • Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR, pages 1914–1923, 2016. 3
    Google ScholarLocate open access versionFindings
  • Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015. 2, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 4724–4733, 2017. 3, 5, 12
    Google ScholarLocate open access versionFindings
  • Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, pages 1130–1139, 2018. 1, 3, 5, 8, 12
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017. 5, 7
    Google ScholarLocate open access versionFindings
  • Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S Davis, and Yan Qiu Chen. Temporal context network for activity localization in videos. In ICCV, pages 5727–5736, 2017. 6, 8, 12
    Google ScholarLocate open access versionFindings
  • Li Ding and Chenliang Xu. Weakly-supervised action segmentation with iterative soft boundary assignment. In CVPR, pages 6508–6516, 2018. 3
    Google ScholarLocate open access versionFindings
  • Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Daps: Deep action proposals for action understanding. In ECCV, pages 768–784, 2016. 3
    Google ScholarLocate open access versionFindings
  • Alireza Fathi, Xiaofeng Ren, and James M Rehg. Learning to recognize objects in egocentric activities. In CVPR, pages 3281–3288, 2011. 3
    Google ScholarLocate open access versionFindings
  • Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. TPAMI, 28(4):594–611, 2006. 2
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019. 3
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468–3476, 2016. 3
    Google ScholarLocate open access versionFindings
  • Jiyang Gao, Kan Chen, and Ram Nevatia. Ctap: Complementary temporal action proposal generation. In ECCV, September 2018. 3
    Google ScholarLocate open access versionFindings
  • Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV, pages 3648–3656, 2017. 3
    Google ScholarLocate open access versionFindings
  • Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015. 5, 12
    Google ScholarLocate open access versionFindings
  • Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. TPAMI, 29(12):2247–2253, 2007. 3
    Google ScholarLocate open access versionFindings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, number 4, page 5, 2017. 3
    Google ScholarLocate open access versionFindings
  • Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, pages 6047–6056, 2018. 3
    Google ScholarLocate open access versionFindings
  • Fabian Caba Heilbron, Wayner Barrios, Victor Escorcia, and Bernard Ghanem. Scc: Semantic context cascade for efficient action detection. In CVPR, pages 3175–3184, 2017. 3
    Google ScholarLocate open access versionFindings
  • Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In ICCV, pages 5822–5831, 2017. 3
    Google ScholarLocate open access versionFindings
  • Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013. 3
    Google ScholarLocate open access versionFindings
  • YG Jiang, Jingen Liu, A Roshan Zamir, G Toderici, I Laptev, Mubarak Shah, and Rahul Sukthankar. Thumos challenge: Action recognition with a large number of classes, 2014. 2, 3, 4, 5
    Google ScholarFindings
  • Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatiotemporal action localization. In ICCV, pages 4405–4413, 2017. 3
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014. 3
    Google ScholarLocate open access versionFindings
  • Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goaldirected human activities. In CVPR, pages 780–787, 2014. 3
    Google ScholarLocate open access versionFindings
  • Hildegard Kuehne, Hueihan Jhuang, Estıbaliz Garrote, Tomaso A. Poggio, and Thomas Serre. HMDB: A large video database for human motion recognition. In Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc Van Gool, editors, ICCV, pages 2556–2563, 2011. 3
    Google ScholarFindings
  • Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. In CVPR, pages 156–165, 2017. 3
    Google ScholarLocate open access versionFindings
  • Colin Lea, Austin Reiter, Rene Vidal, and Gregory D Hager. Segmental spatiotemporal cnns for fine-grained action segmentation. In ECCV, pages 36–52, 2016. 3
    Google ScholarLocate open access versionFindings
  • Peng Lei and Sinisa Todorovic. Temporal deformable residual networks for action segmentation in videos. In CVPR, pages 6742–6751, 2018. 3
    Google ScholarLocate open access versionFindings
  • Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. TPAMI, 2020. 3
    Google ScholarLocate open access versionFindings
  • Yixuan Li, Zixu Wang, Limin Wang, and Gangshan Wu. Actions as moving points. ECCV, 2020. 3
    Google ScholarLocate open access versionFindings
  • Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In ICCV, pages 7083–7093, 2019. 3
    Google ScholarLocate open access versionFindings
  • Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, pages 3889–3898, 2019. 5, 8, 12
    Google ScholarLocate open access versionFindings
  • Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In ACM MM, pages 988–996, 2017. 3
    Google ScholarLocate open access versionFindings
  • Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV, September 2018. 3, 5, 12
    Google ScholarLocate open access versionFindings
  • Daochang Liu, Tingting Jiang, and Yizhou Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR, pages 1298–1307, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang. Multi-granularity generator for temporal action proposal. In CVPR, pages 3604–3613, 2019. 3
    Google ScholarLocate open access versionFindings
  • Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In CVPR, pages 344–353, 2019. 1, 3, 8
    Google ScholarLocate open access versionFindings
  • Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activity progression in lstms for activity detection and early detection. In CVPR, June 2016. 3
    Google ScholarLocate open access versionFindings
  • Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. TPAMI, 42(2):502–508, 2019. 3
    Google ScholarLocate open access versionFindings
  • Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action localization by sparse temporal pooling network. In CVPR, pages 6752–6761, 2018. 3
    Google ScholarLocate open access versionFindings
  • Bingbing Ni, Xiaokang Yang, and Shenghua Gao. Progressively parsing interactional objects for fine grained action detection. In CVPR, pages 1020–1028, 2016. 3
    Google ScholarLocate open access versionFindings
  • Sujoy Paul, Sourya Roy, and Amit K. Roy-Chowdhury. Wtalc: Weakly-supervised temporal activity localization and classification. In ECCV, September 2018. 3
    Google ScholarLocate open access versionFindings
  • Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatiotemporal representation with pseudo-3d residual networks. In ICCV, pages 5533–5541, 2017. 3, 12
    Google ScholarLocate open access versionFindings
  • Alexander Richard and Juergen Gall. Temporal action detection using a statistical language model. In CVPR, pages 3131–3140, 2016. 3
    Google ScholarLocate open access versionFindings
  • Alexander Richard, Hilde Kuehne, and Juergen Gall. Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, pages 754–763, 2017. 3
    Google ScholarLocate open access versionFindings
  • Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119(3):346–373, 2016. 3
    Google ScholarLocate open access versionFindings
  • Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In ICPR, volume 3, pages 32–36, 2004. 3
    Google ScholarLocate open access versionFindings
  • Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, pages 2616–2625, 2020. 3
    Google ScholarLocate open access versionFindings
  • Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-deconvolutional networks for precise temporal action localization in untrimmed videos. In ICCV, pages 1417–1426, 2017. 3, 8
    Google ScholarLocate open access versionFindings
  • Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV, pages 154–171, 2018. 3
    Google ScholarLocate open access versionFindings
  • Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058, 2016. 1, 3, 5, 8, 12
    Google ScholarLocate open access versionFindings
  • Gunnar A Sigurdsson, Gul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, pages 510–526, 2016. 3
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. 3, 12
    Google ScholarLocate open access versionFindings
  • Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, pages 3637–3646, 2017. 3
    Google ScholarLocate open access versionFindings
  • Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. 3
    Findings
  • Sebastian Stein and Stephen J McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 729–738, 2013. 3
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. 3
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450– 6459, 2018. 3
    Google ScholarLocate open access versionFindings
  • Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558, 2013. 3
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly supervised action recognition and detection. In CVPR, pages 4325–4334, 2017. 3
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016. 3
    Google ScholarLocate open access versionFindings
  • Zhenzhi Wang, Ziteng Gao, Limin Wang, Zhifeng Li, and Gangshan Wu. Boundary-aware cascade networks for temporal action segmentation. In ECCV, 2020. 3
    Google ScholarLocate open access versionFindings
  • Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, and Gangshan Wu. Context-aware rcnn: A baseline for action detection in videos. ECCV, 2020. 3
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017. 5
    Google ScholarLocate open access versionFindings
  • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 305–321, 2018. 3
    Google ScholarLocate open access versionFindings
  • Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, and Xiaoou Tang. CUHK & ETHZ & SIAT submission to ActivityNet challenge 2016. arXiv preprint arXiv:1608.00797, 2016. 12
    Findings
  • Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: region convolutional 3d network for temporal activity detection. In ICCV, pages 5794–5803, 2017. 2, 3, 8
    Google ScholarLocate open access versionFindings
  • Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. G-TAD: Sub-graph localization for temporal action detection. In CVPR, pages 10156–10165, 2020. 1, 7, 8
    Google ScholarLocate open access versionFindings
  • Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal pyramid network for action recognition. In CVPR, pages 591–600, 2020. 3
    Google ScholarLocate open access versionFindings
  • Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. Step: Spatio-temporal progressive learning for video action detection. In CVPR, pages 264–272, 2019. 3
    Google ScholarLocate open access versionFindings
  • Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. Eventnet: A large scale structured concept library for complex event detection in video. In ACM MM, pages 471–480, 2015. 3
    Google ScholarLocate open access versionFindings
  • Serena Yeung, Olga Russakovsky, Greg Mori, and Li FeiFei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, pages 2678–2687, 2016. 3
    Google ScholarLocate open access versionFindings
  • Tan Yu, Zhou Ren, Yuncheng Li, Enxu Yan, Ning Xu, and Junsong Yuan. Temporal structure mining for weakly supervised action detection. In ICCV, pages 5522–5531, 2019. 3
    Google ScholarLocate open access versionFindings
  • Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A Kassim. Temporal action localization with pyramid of score distribution features. In CVPR, pages 3093–3102, 2016. 3
    Google ScholarLocate open access versionFindings
  • Ze-Huan Yuan, Jonathan C Stroud, Tong Lu, and Jia Deng. Temporal action localization by structured maximal sums. In CVPR, volume 2, page 7, 2017. 1, 3, 8
    Google ScholarLocate open access versionFindings
  • Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. Graph convolutional networks for temporal action localization. In ICCV, pages 7094–7103, 2019. 1, 2, 3, 5, 7, 8, 12
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. HACS: human action clips and segments dataset for recognition and temporal localization. In ICCV, pages 8667–8677. IEEE, 2019. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, and Qi Tian. Bottom-up temporal action localization with mutual regularization. In ECCV, 2020. 1, 2, 3, 5, 7, 8
    Google ScholarLocate open access versionFindings
  • Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. ICCV, 2, 2017. 1, 3, 5, 8
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科