Flow-Guided Feature Aggregation for Video Object Detection
ICCV, pp. 408-417, 2017.
EI
Weibo:
Abstract:
Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flo...More
Code:
Data:
Introduction
- Recent years have witnessed significant progress in object detection [11]. State-of-the-art methods share a similar two-stage structure.
- A shallow detection-specific network [13, 10, 30, 26, 5] generates the detection results from the feature maps.
- These methods achieve excellent results in still images.
- A state-of-the-art stillimage object detector (R-FCN [5] + ResNet-101 [14]) deteriorates remarkably for fast moving objects (Table 1 (a))
Highlights
- Recent years have witnessed significant progress in object detection [11]
- We propose flow-guided feature aggregation (FGFA)
- RFCN [5] replaces ROI pooling operation on the intermediate feature maps with position-sensitivity ROI pooling operation on the final score maps, pushing the feature sharing to an extreme. In contrast to these methods of still-image object detection, our method focuses on object detection in videos
- Besides the standard mean averageprecision scores, we report the mAP scores over the slow, medium, and fast groups, respectively, denoted as mAP(slow), mAP(medium), and mAP(fast)
- This work aims at a principled learning framework for video object detection instead of the best system
Methods
- ImageNet VID dataset [33].
- It is a prevalent large-scale benchmark for video object detection.
- Following the protocols in [18, 23], model training and evaluation are performed on the 3,862 video snippets from the training set and the 555 snippets from the validation set, respectively.
- The snippets are fully annotated, and are at frame rates of 25 or 30 fps in general.
- They are a subset of the categories in the ImageNet DET dataset
Results
- The authors' method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects.
- Evaluation on motion groups shows that detecting fast moving objects is very challenging: mAP is 82.4% for slow motion, and it drops to 51.4% for fast motion.
- Method (c) adds the adaptive weighting module into (b).
- It obtains a mAP 74.3%, 2.3% higher than that of (b).
- The proposed FGFA method improves the overall mAP score by 2.9%, and mAP by 6.2% compared to the single-frame baseline (a)
Conclusion
- This work presents an accurate, end-to-end and principled learning framework for video object detection.
- Because the approach focuses on improving feature quality, it would be complementary to existing box-level framework for better accuracy in video frames.
- The authors' method can further leverage better adaptive memory scheme in the aggregation instead of the attention model used.
- The authors believe these open questions will inspire more future work
Summary
Introduction:
Recent years have witnessed significant progress in object detection [11]. State-of-the-art methods share a similar two-stage structure.- A shallow detection-specific network [13, 10, 30, 26, 5] generates the detection results from the feature maps.
- These methods achieve excellent results in still images.
- A state-of-the-art stillimage object detector (R-FCN [5] + ResNet-101 [14]) deteriorates remarkably for fast moving objects (Table 1 (a))
Objectives:
Given the input video frames {Ii}, i = 1, .- ∞, the authors aim to output object bounding boxes on all the frames, {yi}, i = 1, .
- This work aims at a principled learning framework for video object detection instead of the best system
Methods:
ImageNet VID dataset [33].- It is a prevalent large-scale benchmark for video object detection.
- Following the protocols in [18, 23], model training and evaluation are performed on the 3,862 video snippets from the training set and the 555 snippets from the validation set, respectively.
- The snippets are fully annotated, and are at frame rates of 25 or 30 fps in general.
- They are a subset of the categories in the ImageNet DET dataset
Results:
The authors' method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects.- Evaluation on motion groups shows that detecting fast moving objects is very challenging: mAP is 82.4% for slow motion, and it drops to 51.4% for fast motion.
- Method (c) adds the adaptive weighting module into (b).
- It obtains a mAP 74.3%, 2.3% higher than that of (b).
- The proposed FGFA method improves the overall mAP score by 2.9%, and mAP by 6.2% compared to the single-frame baseline (a)
Conclusion:
This work presents an accurate, end-to-end and principled learning framework for video object detection.- Because the approach focuses on improving feature quality, it would be complementary to existing box-level framework for better accuracy in video frames.
- The authors' method can further leverage better adaptive memory scheme in the aggregation instead of the attention model used.
- The authors believe these open questions will inspire more future work
Tables
- Table1: Accuracy and runtime of different methods on ImageNet VID validation, using ResNet-50 and ResNet-101 feature extraction networks. The relative gains compared to the single-frame baseline (a) are listed in the subscript
- Table2: Detection accuracy of small (area< 502 pixels), medium (502 ≤area≤ 1502pixels), and large (area> 1502pixels) object instances of the single-frame baseline (entry (a)) in Table 1
- Table3: Results of using different number of frames in training and inference, using ResNet-50. Default parameters are indicated by *
- Table4: Results of baseline method and FGFA before and after combination with box level techniques. As for runtime, entry marked with * utilizes CPU implementation of box-level techniques
Related work
- Object detection from image. State-of-the-art methods for general object detection [10, 30, 26, 5] are mainly based on deep CNNs [22, 36, 40, 14]. In [11], a multistage pipeline called Regions with Convolutional Neural Networks (R-CNN) is proposed for training deep CNN to classify region proposals for object detection. To speedup, ROI pooling is introduced to the feature maps shared on the whole image in SPP-Net [13] and Fast R-CNN [10]. In Faster R-CNN [30], the region proposals are generated by the Region Proposal Network (RPN), and features are shared between RPN and Fast R-CNN. Most recently, RFCN [5] replaces ROI pooling operation on the intermediate feature maps with position-sensitivity ROI pooling operation on the final score maps, pushing the feature sharing to an extreme.
Funding
- Our method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects
- Evaluation on motion groups shows that detecting fast moving objects is very challenging: mAP is 82.4% for slow motion, and it drops to 51.4% for fast motion
- Method (c) adds the adaptive weighting module into (b). It obtains a mAP 74.3%, 2.3% higher than that of (b)
- The improvement for fast motion is more significant (52.3% → 57.6%)
- The proposed FGFA method improves the overall mAP score by 2.9%, and mAP (fast) by 6.2% compared to the single-frame baseline (a)
Reference
- N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. In ICLR, 2016. 3
- T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, 2004. 3
- T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI, 2011. 3
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 5
- J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 2, 3, 5
- J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017. 5
- J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 3
- A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015. 1, 3, 5
- M. Fayyaz, M. Hajizadeh Saffar, M. Sabokrou, M. Fathy, R. Klette, and F. Huang. Stfcn: Spatio-temporal fcn for semantic video segmentation. In ACCV Workshop, 2016. 3
- R. Girshick. Fast r-cnn. In ICCV, 2015. 1, 2, 3
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 2, 3
- W. Han, P. Khorrami, T. Le Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016. 1, 2, 7
- K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 2, 5
- B. K. Horn and B. G. Schunck. Determining optical flow. In Artificial intelligence, 1981. 3
- N. Hyeonseob and H. Bohyung. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 203
- E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 203
- K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, and W. Ouyang. T-cnn: Tubelets with convolutional neural networks for object detection from videos. arXiv preprint arxiv:1604.02532, 2016. 1, 2, 5, 7
- K. Kang, W. Ouyang, H. Li, and X. Wang. Object detection from video tubelets with convolutional neural networks. In CVPR, 2016. 1, 2
- A. Kar, N. Rai, K. Sikka, and G. Sharma. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In CVPR, 2017. 3
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 3
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2
- B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. K. Rhee. Multi-class multi-object tracking using changing point detection. In ECCV, 2016. 1, 2, 5
- Z. Li, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. arXiv preprint arXiv:1607.01794, 2016. 3
- W. Lijun, O. Wanli, W. Xiaogang, and L. Huchuan. Visual tracking with fully convolutional networks. In ICCV, 2015. 3
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 1, 2, 3
- C. Luo, J. Zhan, L. Wang, and Q. Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. arXiv preprint arXiv:1702.05870, 2017. 4
- A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850, 2016. 3
- E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, 2017. 8
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 3
- J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In CVPR, 2015. 3
- A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015. 4
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and F.-F. Li. Imagenet large scale visual recognition challenge. In IJCV, 2015. 1, 2, 5
- S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. In ICLR Workshop, 2016. 3
- M. Siam, S. Valipour, M. Jagersand, and N. Ray. Convolutional gated recurrent networks for video segmentation. arXiv preprint arXiv:1611.05435, 2016. 3
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 2
- N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. In JMLR, 2014. 5
- L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015. 3
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inceptionv4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 5
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1, 2
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. 3
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Deep end2end voxel2voxel prediction. In CVPR Workshop, 2016. 3
- J. Weickert, A. Bruhn, T. Brox, and N. Papenberg. A survey on variational optic flow methods for small displacements. In Mathematical models for registration and applications to medical imaging. 2006. 3
- P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In ICCV, 2013. 3
- J. Yang, H. Shuai, Z. Yu, R. Fan, Q. Ma, Q. Liu, and J. Deng. Efficient object detection from videos. http://image-net.org/challenges/talks/2016/Imagenet2016VID.pptx, 2016.8
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015. 3
- J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015. 3
- H. Zhang and N. Wang. On the stability of video detection and tracking. arXiv preprint arXiv:1611.06467, 2016. 8
- X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. In CVPR, 2017. 3, 5, 6
Full Text
Tags
Comments