Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

CVPR, pp. 13072-13082, 2020.

Cited by: 3|Views42
EI
Weibo:
In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera

Abstract:

In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame ...More

Code:

Data:

0
Introduction
  • The authors seek to improve recognition within passive monitoring cameras, which are static and collect sparse data over long time horizons.1 Passive monitoring deployments are ubiquitous and present unique challenges for computer vision and offer unique opportunities that can be leveraged for improved accuracy.

    For example, depending on the triggering mechanism and the camera placement, large numbers of photos at any given camera location can be empty of any objects of interest [31].
  • As the images in static passive-monitoring cameras are taken automatically, there is no guarantee that the objects of interest will be centered, focused, well-lit, or an appropriate scale.
  • The authors break these challenges into three categories, each of which can cause failures in single-frame detection networks:.
  • Objects can be very close to the camera and occluded by the edges of the frame, partially hidden in the environment due to camouflage, or very far from the camera
Highlights
  • We seek to improve recognition within passive monitoring cameras, which are static and collect sparse data over long time horizons.1 Passive monitoring deployments are ubiquitous and present unique challenges for computer vision and offer unique opportunities that can be leveraged for improved accuracy.

    For example, depending on the triggering mechanism and the camera placement, large numbers of photos at any given camera location can be empty of any objects of interest [31]
  • We propose Context R-CNN, which leverages temporal context for improving object detection regardless of frame rate or sampling irregularity
  • We evaluate all models on held-out camera locations, using established object detection metrics: mean average precision at 0.5 IoU and Average Recall (AR)
  • We find that while a month of context from a feature extractor tuned for Snapshot Serengeti achieves 5.3% higher mean average precision than one trained only on COCO, we are able to outperform the single-frame model by 12.4% using memory features that have never before seen a camera trap image
  • It is apparent from our results that what and how much information is stored in memory is both important and domain specific
  • We plan to explore this in detail in the future, and hope to develop methods for curating diverse memory banks which are optimized for accuracy and size, to reduce the computational and storage overheads at training and inference time while maintaining performance gains
Methods
  • Context R-CNN, builds a “memory bank” based on contextual frames and modifies a detection model to make predictions conditioned on this memory bank.
Results
  • Context R-CNN strongly outperforms the single-frame Faster RCNN with Resnet-101 baseline on both the Snapshot Serengeti (SS) and Caltech Camera Traps (CCT)

    Model Single Frame Context R-CNN (a) Results across datasets

    SS mAP AR One minute 50.3 51.4

    One hour 52.1 52.5 One day 52.5 52.9 One week 54.1 53.2 One month 55.6 57.5 (b) Time horizon

    SS mAP AR One box per frame 55.6 57.5

    COCO features 50.3 55.8 Only positive boxes 53.9 56.2

    Subsample half 52.5 56.1 Subsample quarter 50.8 55.0

    (c) Selecting memory

    SS mAP AR Single Frame 37.9 46.5

    Maj.
  • Context R-CNN strongly outperforms the single-frame Faster RCNN with Resnet-101 baseline on both the Snapshot Serengeti (SS) and Caltech Camera Traps (CCT).
  • Model Single Frame Context R-CNN (a) Results across datasets.
  • SS mAP AR One minute 50.3 51.4.
  • One hour 52.1 52.5 One day 52.5 52.9 One week 54.1 53.2 One month 55.6 57.5 (b) Time horizon.
  • SS mAP AR One box per frame 55.6 57.5.
  • COCO features 50.3 55.8 Only positive boxes 53.9 56.2.
  • SS mAP AR Single Frame 37.9 46.5.
  • Vote 37.8 46.4 ST Spatial 39.6 36.0
Conclusion
  • Conclusions and Future

    Work

    In this work, the authors contribute a model that leverages percamera temporal context up to a month, far beyond the time horizon of previous approaches, and show that in the static camera setting, attention-based temporal context is beneficial.
  • Context R-CNN, is general across static camera domains, improving detection performance over single-frame baselines on both camera trap and traffic camera data.
  • Context R-CNN is adaptive and robust to passive-monitoring sampling strategies that provide data streams with low, irregular frame rates.
  • The authors plan to explore this in detail in the future, and hope to develop methods for curating diverse memory banks which are optimized for accuracy and size, to reduce the computational and storage overheads at training and inference time while maintaining performance gains
Summary
  • Introduction:

    The authors seek to improve recognition within passive monitoring cameras, which are static and collect sparse data over long time horizons.1 Passive monitoring deployments are ubiquitous and present unique challenges for computer vision and offer unique opportunities that can be leveraged for improved accuracy.

    For example, depending on the triggering mechanism and the camera placement, large numbers of photos at any given camera location can be empty of any objects of interest [31].
  • As the images in static passive-monitoring cameras are taken automatically, there is no guarantee that the objects of interest will be centered, focused, well-lit, or an appropriate scale.
  • The authors break these challenges into three categories, each of which can cause failures in single-frame detection networks:.
  • Objects can be very close to the camera and occluded by the edges of the frame, partially hidden in the environment due to camouflage, or very far from the camera
  • Methods:

    Context R-CNN, builds a “memory bank” based on contextual frames and modifies a detection model to make predictions conditioned on this memory bank.
  • Results:

    Context R-CNN strongly outperforms the single-frame Faster RCNN with Resnet-101 baseline on both the Snapshot Serengeti (SS) and Caltech Camera Traps (CCT)

    Model Single Frame Context R-CNN (a) Results across datasets

    SS mAP AR One minute 50.3 51.4

    One hour 52.1 52.5 One day 52.5 52.9 One week 54.1 53.2 One month 55.6 57.5 (b) Time horizon

    SS mAP AR One box per frame 55.6 57.5

    COCO features 50.3 55.8 Only positive boxes 53.9 56.2

    Subsample half 52.5 56.1 Subsample quarter 50.8 55.0

    (c) Selecting memory

    SS mAP AR Single Frame 37.9 46.5

    Maj.
  • Context R-CNN strongly outperforms the single-frame Faster RCNN with Resnet-101 baseline on both the Snapshot Serengeti (SS) and Caltech Camera Traps (CCT).
  • Model Single Frame Context R-CNN (a) Results across datasets.
  • SS mAP AR One minute 50.3 51.4.
  • One hour 52.1 52.5 One day 52.5 52.9 One week 54.1 53.2 One month 55.6 57.5 (b) Time horizon.
  • SS mAP AR One box per frame 55.6 57.5.
  • COCO features 50.3 55.8 Only positive boxes 53.9 56.2.
  • SS mAP AR Single Frame 37.9 46.5.
  • Vote 37.8 46.4 ST Spatial 39.6 36.0
  • Conclusion:

    Conclusions and Future

    Work

    In this work, the authors contribute a model that leverages percamera temporal context up to a month, far beyond the time horizon of previous approaches, and show that in the static camera setting, attention-based temporal context is beneficial.
  • Context R-CNN, is general across static camera domains, improving detection performance over single-frame baselines on both camera trap and traffic camera data.
  • Context R-CNN is adaptive and robust to passive-monitoring sampling strategies that provide data streams with low, irregular frame rates.
  • The authors plan to explore this in detail in the future, and hope to develop methods for curating diverse memory banks which are optimized for accuracy and size, to reduce the computational and storage overheads at training and inference time while maintaining performance gains
Tables
  • Table1: Results. All results are based on Faster R-CNN with a Resnet
Download tables as Excel
Related work
  • Single frame object detection. Driven by popular benchmarks such as COCO [25] and Open Images [22], there have been a number of advances in single frame object detection in recent years. These detection architectures include anchor-based models, both single stage (e.g., SSD [27], RetinaNet [24], Yolo [32, 33]) and two-stage (e.g., Fast/Faster R-CNN [14, 18, 34], R-FCN [10]), as well as more recent anchor-free models (e.g., CornerNet [23], CenterNet [56], FCOS [41]). Object detection methods have shown great improvements on COCO- or Imagenet-style images, but these gains do not always generalize to challenging real-world data (See Figure 2).

    Video object detection. Single frame architectures then form the basis for video detection and spatio-temporal action localization architectures, which build upon single frame models by incorporating contextual cues from other frames in order to deal with more specific challenges that arise in video data including motion blur, occlusion, and rare poses. Leading methods have used pixel level flow (or flow-like concepts) to aggregate features [7, 57,58,59] or used correlation [13] to densely relate features at the current timestep to an adjacent timestep. Other papers have explored the use of 3d convolutions (e.g., I3D, S3D) [8,29,48] or recurrent networks [20,26] to extract better temporal features. Finally, many works apply video specific postprocessing to smooth predictions along time, including tubelet smoothing [15] or SeqNMS [16].
Funding
  • This work was supported by NSFGRFP Grant No 1745301, the views are those of the authors and do not necessarily reflect the views of the NSF
Reference
  • Lila.science. http://lila.science/. Accessed:2019-10-22.6
    Findings
  • Carlos Arteta, Victor Lempitsky, and Andrew Zisserman. Counting in the wild. pages 483–498, 2016. 3
    Google ScholarFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. 3
    Findings
  • Sara Beery, Yang Liu, Dan Morris, Jim Piavis, Ashish Kapoor, Markus Meister, and Pietro Perona. Synthetic examples improve generalization for rare classes. arXiv preprint arXiv:1904.05916, 2019. 3, 12
    Findings
  • Sara Beery and Dan Morris. Efficient pipeline for automating species id in new camera trap projects. Biodiversity Information Science and Standards, 3:e37222, 2019. 3, 6
    Google ScholarLocate open access versionFindings
  • Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), pages 456–473, 2018. 2, 3, 6
    Google ScholarLocate open access versionFindings
  • Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 331–346, 2018. 3
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 3
    Google ScholarLocate open access versionFindings
  • Antoni B Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. pages 1–7, 2008. 3
    Google ScholarFindings
  • Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016. 3
    Google ScholarLocate open access versionFindings
  • Hanming Deng, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. Object guided external memory network for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6678–6687, 2019. 3
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3
    Findings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pages 3038–3046, 2017. 3
    Google ScholarLocate open access versionFindings
  • Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. 3
    Google ScholarLocate open access versionFindings
  • Georgia Gkioxari and Jitendra Malik. Finding action tubes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 759–768, 203
    Google ScholarLocate open access versionFindings
  • Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 203
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 204
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3
    Google ScholarLocate open access versionFindings
  • Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310–7311, 2017. 4, 12
    Google ScholarLocate open access versionFindings
  • Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 727–735, 2017. 3
    Google ScholarLocate open access versionFindings
  • Sameer Kumar, Victor Bitorff, Dehao Chen, Chiachen Chou, Blake Hechtman, HyoukJoong Lee, Naveen Kumar, Peter Mattson, Shibo Wang, Tao Wang, et al. Scale mlperf-0.6 models on google tpu-v3 pods. arXiv preprint arXiv:1909.09756, 2019. 12
    Findings
  • Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018. 2
    Findings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018. 3
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 3
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2, 12
    Google ScholarLocate open access versionFindings
  • Mason Liu and Menglong Zhu. Mobile video object detection with temporally-aware feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5686–5695, 2018. 3
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 3
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019. 3
    Findings
  • Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018. 3
    Google ScholarLocate open access versionFindings
  • Agnieszka Miguel, Sara Beery, Erica Flores, Loren Klemesrud, and Rana Bayrakcismith. Finding areas of motion in camera trap images. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 1334–1338. IEEE, 2016. 3
    Google ScholarLocate open access versionFindings
  • Mohammad Sadegh Norouzzadeh, Anh Nguyen, Margaret Kosmala, Alexandra Swanson, Meredith S Palmer, Craig Packer, and Jeff Clune. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences, 115(25):E5716–E5725, 2018. 1, 3
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 3
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. 3
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 3, 4
    Google ScholarLocate open access versionFindings
  • Stefan Schneider, Graham W Taylor, and Stefan Kremer. Deep learning object detection methods for ecological camera trap data. In 2018 15th Conference on Computer and Robot Vision (CRV), pages 321–328. IEEE, 2018. 3
    Google ScholarLocate open access versionFindings
  • Ankit Parag Shah, Jean-Bapstite Lamare, Tuan Nguyen-Anh, and Alexander Hauptmann. Cadp: A novel dataset for cctv traffic camera based accident analysis. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–9. IEEE, 2018. 3
    Google ScholarLocate open access versionFindings
  • Mykhailo Shvets, Wei Liu, and Alexander C Berg. Leveraging long-range temporal relationships between proposals for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 9756–9764, 2019. 3
    Google ScholarLocate open access versionFindings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visuallinguistic representations. arXiv preprint arXiv:1908.08530, 2019. 3
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019. 3
    Findings
  • Alexandra Swanson, Margaret Kosmala, Chris Lintott, Robert Simpson, Arfon Smith, and Craig Packer. Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data, 2:150026, 2015. 6, 8
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355, 2019. 3
    Findings
  • Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018. 2, 12
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 3, 5
    Google ScholarLocate open access versionFindings
  • Alexander Gomez Villa, Augusto Salazar, and Francisco Vargas. Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecological Informatics, 41:24–32, 2017. 3
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018. 3
    Google ScholarLocate open access versionFindings
  • Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019. 3
    Google ScholarLocate open access versionFindings
  • Haiping Wu, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 9217–9225, 2019. 3
    Google ScholarLocate open access versionFindings
  • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018. 3, 6
    Google ScholarLocate open access versionFindings
  • Feng Xiong, Xingjian Shi, and Dit-Yan Yeung. Spatiotemporal modeling for crowd counting in videos. pages 5151– 5159, 2017. 3
    Google ScholarFindings
  • Hayder Yousif, Jianhe Yuan, Roland Kays, and Zhihai He. Fast human-animal detection from highly cluttered cameratrap images using joint background modeling and deep learning classification. In Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pages 1–4. IEEE, 2017. 3
    Google ScholarLocate open access versionFindings
  • Xiaoyuan Yu, Jiangping Wang, Roland Kays, Patrick A Jansen, Tianjiang Wang, and Thomas Huang. Automated identification of animal species in camera trap images. EURASIP Journal on Image and Video Processing, 2013(1):52, 2013. 3
    Google ScholarLocate open access versionFindings
  • Shanghang Zhang, Guanhang Wu, Joao P Costeira, and Jose MF Moura. Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 3667–3676, 2017. 3
    Google ScholarLocate open access versionFindings
  • Shanghang Zhang, Guanhang Wu, Joao P Costeira, and Jose MF Moura. Understanding traffic density from largescale web camera data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5898–5907, 2017. 3, 6
    Google ScholarLocate open access versionFindings
  • Zhi Zhang, Zhihai He, Guitao Cao, and Wenming Cao. Animal detection from highly cluttered natural scenes using spatiotemporal object region proposals and patch verification. IEEE Transactions on Multimedia, 18(10):2079–2092, 2016. 3
    Google ScholarLocate open access versionFindings
  • Han Zhao, Shanghang Zhang, Guanhang Wu, Jose MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pages 8559–8570, 2018. 3
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. 3
    Findings
  • Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. Towards high performance video object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7210–7218, 2018. 3
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017. 3
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017. 3
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments