Feature Pyramid Networks for Object Detection
CVPR, 2017.
EI
Weibo:
Abstract:
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But pyramid representations have been avoided in recent object detectors that are based on deep convolutional networks, partially because they are slow to compute and memory intensive. In this paper, we exploit the inherent multi-scale...More
Code:
Data:
Introduction
- Recognizing objects at vastly different scales is a fundamental challenge in computer vision.
- Feature pyramids built upon image pyramids form the basis of a standard solution [1] (Fig. 1(a))
- These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid.
- Aside from being capable of representing higher-level semantics, ConvNets are more robust to variance in scale and facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b))
- Even with this robustness, pyramids are still needed to get the most accurate results.
- The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels
Highlights
- Recognizing objects at vastly different scales is a fundamental challenge in computer vision
- We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]
- We find that for bounding box proposals, Feature Pyramid Network significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style Average Precision by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]
- At multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]
- We note that the parameters of the heads are shared across all feature pyramid levels; we have evaluated the alternative without sharing parameters and observed similar accuracy
- We report segment Average Recall and segment Average Recall on small, medium, and large objects, always for 1000 proposals
Methods
- A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations.
- Ghiasi et al [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation
- These methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2.
- For the pyramidal architecture in Fig. 2, image pyramids are still needed to recognize objects across multiple scales [28]
Results
- Results are shown in Table 6.
- The authors report segment AR and segment AR on small, medium, and large objects, always for 1000 proposals.
- The authors' baseline FPN model with a single 5×5 MLP achieves an AR of 43.4.
- Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged.
- Using both MLPs together increases accuracy to 45.7 AR.
- Doubling the training iterations increases AR to 48.1
Conclusion
- The authors have presented a clean and simple framework for building feature pyramids inside ConvNets.
- The authors' method shows significant improvements over several strong baselines and competition winners.
- It provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids.
- The authors' study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations
Summary
Introduction:
Recognizing objects at vastly different scales is a fundamental challenge in computer vision.- Feature pyramids built upon image pyramids form the basis of a standard solution [1] (Fig. 1(a))
- These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid.
- Aside from being capable of representing higher-level semantics, ConvNets are more robust to variance in scale and facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b))
- Even with this robustness, pyramids are still needed to get the most accurate results.
- The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels
Objectives:
The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.Methods:
A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations.- Ghiasi et al [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation
- These methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2.
- For the pyramidal architecture in Fig. 2, image pyramids are still needed to recognize objects across multiple scales [28]
Results:
Results are shown in Table 6.- The authors report segment AR and segment AR on small, medium, and large objects, always for 1000 proposals.
- The authors' baseline FPN model with a single 5×5 MLP achieves an AR of 43.4.
- Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged.
- Using both MLPs together increases accuracy to 45.7 AR.
- Doubling the training iterations increases AR to 48.1
Conclusion:
The authors have presented a clean and simple framework for building feature pyramids inside ConvNets.- The authors' method shows significant improvements over several strong baselines and competition winners.
- It provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids.
- The authors' study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations
Tables
- Table1: Bounding box proposal results using RPN [<a class="ref-link" id="c29" href="#r29">29</a>], evaluated on the COCO minival set. All models are trained on trainval35k. The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively. The column “feature” denotes the feature maps on which the heads are attached. All results are based on ResNet-50 and share the same hyper-parameters
- Table2: Object detection results using Fast R-CNN [<a class="ref-link" id="c11" href="#r11">11</a>] on a fixed set of proposals (RPN, {Pk}, Table 1(c)), evaluated on the COCO minival set. Models are trained on the trainval35k set. All results are based on ResNet-50 and share the same hyper-parameters
- Table3: Object detection results using Faster R-CNN [<a class="ref-link" id="c29" href="#r29">29</a>] evaluated on the COCO minival set. The backbone network for RPN are consistent with Fast R-CNN. Models are trained on the trainval35k set and use ResNet-50. †Provided by authors of [<a class="ref-link" id="c16" href="#r16">16</a>]
- Table4: Comparisons of single-model results on the COCO detection benchmark. Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [<a class="ref-link" id="c40" href="#r40">40</a>] on minival). †: http://image-net.org/challenges/ talks/2016/GRMI-COCO-slidedeck.pdf. ‡: http://mscoco.org/dataset/#detections-leaderboard. §: This entry of AttractioNet [<a class="ref-link" id="c10" href="#r10">10</a>] adopts VGG-16 for proposals and Wide ResNet [<a class="ref-link" id="c39" href="#r39">39</a>] for object detection, so is not strictly a single-model result
- Table5: More object detection results using Faster R-CNN and our FPNs, evaluated on minival. Sharing features increases train time by 1.5× (using 4-step training [<a class="ref-link" id="c29" href="#r29">29</a>]), but reduces test time
- Table6: Instance segmentation proposals evaluated on the first 5k COCO val images. All models are trained on the train set. DeepMask, SharpMask, and FPN use ResNet-50 while InstanceFCN uses VGG-16. DeepMask and SharpMask performance is computed with models available from https://github. com/facebookresearch/deepmask (both are the ‘zoom’ variants). †Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40
Related work
- Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. There has also been significant interest in computing featurized image pyramids quickly. Dollar et al [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.
Funding
- Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners
- Similar to [29], we find that sharing features improves accuracy by a small margin
Reference
- E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden. Pyramid methods in image processing. RCA engineer, 1984.
- S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.
- Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016.
- J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016.
- N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
- P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
- G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In ECCV, 2016.
- S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentation-aware CNN model. In ICCV, 2015.
- S. Gidaris and N. Komodakis. Attend refine repeat: Active box proposal generation via in-out localization. In BMVC, 2016.
- R. Girshick. Fast R-CNN. In ICCV, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
- K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. arXiv:1703.06870, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
- S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-fine feature aggregation. In CVPR, 2016.
- T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR, 2016.
- A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. In ECCV, 2016.
- W. Liu, A. Rabinovich, and A. C. Berg. ParseNet: Looking wider to see better. In ICLR workshop, 2016.
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
- A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
- P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015.
- P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learning to refine object segments. In ECCV, 2016.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. PAMI, 2016.
- O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- H. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158R, Carnegie Mellon University, 1995.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
- A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
- R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994.
- S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
- S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollar. A multipath network for object detection. In BMVC, 2016.
Full Text
Tags
Comments