Flow-Guided Feature Aggregation for Video Object Detection

ICCV, pp. 408-417, 2017.

Cited by: 188|Bibtex|Views116|DOI:https://doi.org/10.1109/iccv.2017.52
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We propose flow-guided feature aggregation

Abstract:

Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flo...More

Code:

Data:

0
Introduction
  • Recent years have witnessed significant progress in object detection [11]. State-of-the-art methods share a similar two-stage structure.
  • A shallow detection-specific network [13, 10, 30, 26, 5] generates the detection results from the feature maps.
  • These methods achieve excellent results in still images.
  • A state-of-the-art stillimage object detector (R-FCN [5] + ResNet-101 [14]) deteriorates remarkably for fast moving objects (Table 1 (a))
Highlights
  • Recent years have witnessed significant progress in object detection [11]
  • We propose flow-guided feature aggregation (FGFA)
  • RFCN [5] replaces ROI pooling operation on the intermediate feature maps with position-sensitivity ROI pooling operation on the final score maps, pushing the feature sharing to an extreme. In contrast to these methods of still-image object detection, our method focuses on object detection in videos
  • Besides the standard mean averageprecision scores, we report the mAP scores over the slow, medium, and fast groups, respectively, denoted as mAP(slow), mAP(medium), and mAP(fast)
  • This work aims at a principled learning framework for video object detection instead of the best system
Methods
  • ImageNet VID dataset [33].
  • It is a prevalent large-scale benchmark for video object detection.
  • Following the protocols in [18, 23], model training and evaluation are performed on the 3,862 video snippets from the training set and the 555 snippets from the validation set, respectively.
  • The snippets are fully annotated, and are at frame rates of 25 or 30 fps in general.
  • They are a subset of the categories in the ImageNet DET dataset
Results
  • The authors' method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects.
  • Evaluation on motion groups shows that detecting fast moving objects is very challenging: mAP is 82.4% for slow motion, and it drops to 51.4% for fast motion.
  • Method (c) adds the adaptive weighting module into (b).
  • It obtains a mAP 74.3%, 2.3% higher than that of (b).
  • The proposed FGFA method improves the overall mAP score by 2.9%, and mAP by 6.2% compared to the single-frame baseline (a)
Conclusion
  • This work presents an accurate, end-to-end and principled learning framework for video object detection.
  • Because the approach focuses on improving feature quality, it would be complementary to existing box-level framework for better accuracy in video frames.
  • The authors' method can further leverage better adaptive memory scheme in the aggregation instead of the attention model used.
  • The authors believe these open questions will inspire more future work
Summary
  • Introduction:

    Recent years have witnessed significant progress in object detection [11]. State-of-the-art methods share a similar two-stage structure.
  • A shallow detection-specific network [13, 10, 30, 26, 5] generates the detection results from the feature maps.
  • These methods achieve excellent results in still images.
  • A state-of-the-art stillimage object detector (R-FCN [5] + ResNet-101 [14]) deteriorates remarkably for fast moving objects (Table 1 (a))
  • Objectives:

    Given the input video frames {Ii}, i = 1, .
  • ∞, the authors aim to output object bounding boxes on all the frames, {yi}, i = 1, .
  • This work aims at a principled learning framework for video object detection instead of the best system
  • Methods:

    ImageNet VID dataset [33].
  • It is a prevalent large-scale benchmark for video object detection.
  • Following the protocols in [18, 23], model training and evaluation are performed on the 3,862 video snippets from the training set and the 555 snippets from the validation set, respectively.
  • The snippets are fully annotated, and are at frame rates of 25 or 30 fps in general.
  • They are a subset of the categories in the ImageNet DET dataset
  • Results:

    The authors' method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects.
  • Evaluation on motion groups shows that detecting fast moving objects is very challenging: mAP is 82.4% for slow motion, and it drops to 51.4% for fast motion.
  • Method (c) adds the adaptive weighting module into (b).
  • It obtains a mAP 74.3%, 2.3% higher than that of (b).
  • The proposed FGFA method improves the overall mAP score by 2.9%, and mAP by 6.2% compared to the single-frame baseline (a)
  • Conclusion:

    This work presents an accurate, end-to-end and principled learning framework for video object detection.
  • Because the approach focuses on improving feature quality, it would be complementary to existing box-level framework for better accuracy in video frames.
  • The authors' method can further leverage better adaptive memory scheme in the aggregation instead of the attention model used.
  • The authors believe these open questions will inspire more future work
Tables
  • Table1: Accuracy and runtime of different methods on ImageNet VID validation, using ResNet-50 and ResNet-101 feature extraction networks. The relative gains compared to the single-frame baseline (a) are listed in the subscript
  • Table2: Detection accuracy of small (area< 502 pixels), medium (502 ≤area≤ 1502pixels), and large (area> 1502pixels) object instances of the single-frame baseline (entry (a)) in Table 1
  • Table3: Results of using different number of frames in training and inference, using ResNet-50. Default parameters are indicated by *
  • Table4: Results of baseline method and FGFA before and after combination with box level techniques. As for runtime, entry marked with * utilizes CPU implementation of box-level techniques
Download tables as Excel
Related work
  • Object detection from image. State-of-the-art methods for general object detection [10, 30, 26, 5] are mainly based on deep CNNs [22, 36, 40, 14]. In [11], a multistage pipeline called Regions with Convolutional Neural Networks (R-CNN) is proposed for training deep CNN to classify region proposals for object detection. To speedup, ROI pooling is introduced to the feature maps shared on the whole image in SPP-Net [13] and Fast R-CNN [10]. In Faster R-CNN [30], the region proposals are generated by the Region Proposal Network (RPN), and features are shared between RPN and Fast R-CNN. Most recently, RFCN [5] replaces ROI pooling operation on the intermediate feature maps with position-sensitivity ROI pooling operation on the final score maps, pushing the feature sharing to an extreme.
Funding
  • Our method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects
  • Evaluation on motion groups shows that detecting fast moving objects is very challenging: mAP is 82.4% for slow motion, and it drops to 51.4% for fast motion
  • Method (c) adds the adaptive weighting module into (b). It obtains a mAP 74.3%, 2.3% higher than that of (b)
  • The improvement for fast motion is more significant (52.3% → 57.6%)
  • The proposed FGFA method improves the overall mAP score by 2.9%, and mAP (fast) by 6.2% compared to the single-frame baseline (a)
Reference
  • N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. In ICLR, 2016. 3
    Google ScholarLocate open access versionFindings
  • T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, 2004. 3
    Google ScholarLocate open access versionFindings
  • T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI, 2011. 3
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 5
    Google ScholarLocate open access versionFindings
  • J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 2, 3, 5
    Google ScholarLocate open access versionFindings
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017. 5
    Findings
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 3
    Google ScholarFindings
  • A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015. 1, 3, 5
    Google ScholarLocate open access versionFindings
  • M. Fayyaz, M. Hajizadeh Saffar, M. Sabokrou, M. Fathy, R. Klette, and F. Huang. Stfcn: Spatio-temporal fcn for semantic video segmentation. In ACCV Workshop, 2016. 3
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast r-cnn. In ICCV, 2015. 1, 2, 3
    Google ScholarFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • W. Han, P. Khorrami, T. Le Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016. 1, 2, 7
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • B. K. Horn and B. G. Schunck. Determining optical flow. In Artificial intelligence, 1981. 3
    Google ScholarLocate open access versionFindings
  • N. Hyeonseob and H. Bohyung. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 203
    Google ScholarLocate open access versionFindings
  • E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 203
    Google ScholarLocate open access versionFindings
  • K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, and W. Ouyang. T-cnn: Tubelets with convolutional neural networks for object detection from videos. arXiv preprint arxiv:1604.02532, 2016. 1, 2, 5, 7
    Findings
  • K. Kang, W. Ouyang, H. Li, and X. Wang. Object detection from video tubelets with convolutional neural networks. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • A. Kar, N. Rai, K. Sikka, and G. Sharma. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 3
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2
    Google ScholarLocate open access versionFindings
  • B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. K. Rhee. Multi-class multi-object tracking using changing point detection. In ECCV, 2016. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Z. Li, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. arXiv preprint arXiv:1607.01794, 2016. 3
    Findings
  • W. Lijun, O. Wanli, W. Xiaogang, and L. Huchuan. Visual tracking with fully convolutional networks. In ICCV, 2015. 3
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • C. Luo, J. Zhan, L. Wang, and Q. Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. arXiv preprint arXiv:1702.05870, 2017. 4
    Findings
  • A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850, 2016. 3
    Findings
  • E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, 2017. 8
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015. 4
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and F.-F. Li. Imagenet large scale visual recognition challenge. In IJCV, 2015. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. In ICLR Workshop, 2016. 3
    Google ScholarLocate open access versionFindings
  • M. Siam, S. Valipour, M. Jagersand, and N. Ray. Convolutional gated recurrent networks for video segmentation. arXiv preprint arXiv:1611.05435, 2016. 3
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. In JMLR, 2014. 5
    Google ScholarLocate open access versionFindings
  • L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015. 3
    Google ScholarLocate open access versionFindings
  • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inceptionv4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 5
    Findings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. 3
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Deep end2end voxel2voxel prediction. In CVPR Workshop, 2016. 3
    Google ScholarLocate open access versionFindings
  • J. Weickert, A. Bruhn, T. Brox, and N. Papenberg. A survey on variational optic flow methods for small displacements. In Mathematical models for registration and applications to medical imaging. 2006. 3
    Google ScholarFindings
  • P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In ICCV, 2013. 3
    Google ScholarFindings
  • J. Yang, H. Shuai, Z. Yu, R. Fan, Q. Ma, Q. Liu, and J. Deng. Efficient object detection from videos. http://image-net.org/challenges/talks/2016/Imagenet2016VID.pptx, 2016.8
    Findings
  • L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015. 3
    Google ScholarLocate open access versionFindings
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • H. Zhang and N. Wang. On the stability of video detection and tracking. arXiv preprint arXiv:1611.06467, 2016. 8
    Findings
  • X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. In CVPR, 2017. 3, 5, 6
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments