Feature Pyramid Networks for Object Detection

CVPR, 2017.

Cited by: 4484|Bibtex|Views478
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We find that for bounding box proposals, Feature Pyramid Network significantly increases the Average Recall by 8.0 points; for object detection, it improves the COCO-style Average Precision by 2.3 points and PASCAL-style Average Precision by 3.8 points, over a strong single-scale...

Abstract:

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But pyramid representations have been avoided in recent object detectors that are based on deep convolutional networks, partially because they are slow to compute and memory intensive. In this paper, we exploit the inherent multi-scale...More

Code:

Data:

Introduction
  • Recognizing objects at vastly different scales is a fundamental challenge in computer vision.
  • Feature pyramids built upon image pyramids form the basis of a standard solution [1] (Fig. 1(a))
  • These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid.
  • Aside from being capable of representing higher-level semantics, ConvNets are more robust to variance in scale and facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b))
  • Even with this robustness, pyramids are still needed to get the most accurate results.
  • The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels
Highlights
  • Recognizing objects at vastly different scales is a fundamental challenge in computer vision
  • We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]
  • We find that for bounding box proposals, Feature Pyramid Network significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style Average Precision by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]
  • At multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]
  • We note that the parameters of the heads are shared across all feature pyramid levels; we have evaluated the alternative without sharing parameters and observed similar accuracy
  • We report segment Average Recall and segment Average Recall on small, medium, and large objects, always for 1000 proposals
Methods
  • A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations.
  • Ghiasi et al [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation
  • These methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2.
  • For the pyramidal architecture in Fig. 2, image pyramids are still needed to recognize objects across multiple scales [28]
Results
  • Results are shown in Table 6.
  • The authors report segment AR and segment AR on small, medium, and large objects, always for 1000 proposals.
  • The authors' baseline FPN model with a single 5×5 MLP achieves an AR of 43.4.
  • Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged.
  • Using both MLPs together increases accuracy to 45.7 AR.
  • Doubling the training iterations increases AR to 48.1
Conclusion
  • The authors have presented a clean and simple framework for building feature pyramids inside ConvNets.
  • The authors' method shows significant improvements over several strong baselines and competition winners.
  • It provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids.
  • The authors' study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations
Summary
  • Introduction:

    Recognizing objects at vastly different scales is a fundamental challenge in computer vision.
  • Feature pyramids built upon image pyramids form the basis of a standard solution [1] (Fig. 1(a))
  • These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid.
  • Aside from being capable of representing higher-level semantics, ConvNets are more robust to variance in scale and facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b))
  • Even with this robustness, pyramids are still needed to get the most accurate results.
  • The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels
  • Objectives:

    The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.
  • Methods:

    A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations.
  • Ghiasi et al [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation
  • These methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2.
  • For the pyramidal architecture in Fig. 2, image pyramids are still needed to recognize objects across multiple scales [28]
  • Results:

    Results are shown in Table 6.
  • The authors report segment AR and segment AR on small, medium, and large objects, always for 1000 proposals.
  • The authors' baseline FPN model with a single 5×5 MLP achieves an AR of 43.4.
  • Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged.
  • Using both MLPs together increases accuracy to 45.7 AR.
  • Doubling the training iterations increases AR to 48.1
  • Conclusion:

    The authors have presented a clean and simple framework for building feature pyramids inside ConvNets.
  • The authors' method shows significant improvements over several strong baselines and competition winners.
  • It provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids.
  • The authors' study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations
Tables
  • Table1: Bounding box proposal results using RPN [<a class="ref-link" id="c29" href="#r29">29</a>], evaluated on the COCO minival set. All models are trained on trainval35k. The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively. The column “feature” denotes the feature maps on which the heads are attached. All results are based on ResNet-50 and share the same hyper-parameters
  • Table2: Object detection results using Fast R-CNN [<a class="ref-link" id="c11" href="#r11">11</a>] on a fixed set of proposals (RPN, {Pk}, Table 1(c)), evaluated on the COCO minival set. Models are trained on the trainval35k set. All results are based on ResNet-50 and share the same hyper-parameters
  • Table3: Object detection results using Faster R-CNN [<a class="ref-link" id="c29" href="#r29">29</a>] evaluated on the COCO minival set. The backbone network for RPN are consistent with Fast R-CNN. Models are trained on the trainval35k set and use ResNet-50. †Provided by authors of [<a class="ref-link" id="c16" href="#r16">16</a>]
  • Table4: Comparisons of single-model results on the COCO detection benchmark. Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [<a class="ref-link" id="c40" href="#r40">40</a>] on minival). †: http://image-net.org/challenges/ talks/2016/GRMI-COCO-slidedeck.pdf. ‡: http://mscoco.org/dataset/#detections-leaderboard. §: This entry of AttractioNet [<a class="ref-link" id="c10" href="#r10">10</a>] adopts VGG-16 for proposals and Wide ResNet [<a class="ref-link" id="c39" href="#r39">39</a>] for object detection, so is not strictly a single-model result
  • Table5: More object detection results using Faster R-CNN and our FPNs, evaluated on minival. Sharing features increases train time by 1.5× (using 4-step training [<a class="ref-link" id="c29" href="#r29">29</a>]), but reduces test time
  • Table6: Instance segmentation proposals evaluated on the first 5k COCO val images. All models are trained on the train set. DeepMask, SharpMask, and FPN use ResNet-50 while InstanceFCN uses VGG-16. DeepMask and SharpMask performance is computed with models available from https://github. com/facebookresearch/deepmask (both are the ‘zoom’ variants). †Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40
Download tables as Excel
Related work
  • Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. There has also been significant interest in computing featurized image pyramids quickly. Dollar et al [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.
Funding
  • Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners
  • Similar to [29], we find that sharing features improves accuracy by a small margin
Reference
  • E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden. Pyramid methods in image processing. RCA engineer, 1984.
    Google ScholarFindings
  • S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
    Google ScholarLocate open access versionFindings
  • P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
    Google ScholarFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
    Google ScholarLocate open access versionFindings
  • G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentation-aware CNN model. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • S. Gidaris and N. Komodakis. Attend refine repeat: Active box proposal generation via in-out localization. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast R-CNN. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. arXiv:1703.06870, 2017.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-fine feature aggregation. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • W. Liu, A. Rabinovich, and A. C. Berg. ParseNet: Looking wider to see better. In ICLR workshop, 2016.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
    Google ScholarLocate open access versionFindings
  • A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learning to refine object segments. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. PAMI, 2016.
    Google ScholarLocate open access versionFindings
  • O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
    Google ScholarLocate open access versionFindings
  • H. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158R, Carnegie Mellon University, 1995.
    Google ScholarFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
    Google ScholarLocate open access versionFindings
  • R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994.
    Google ScholarLocate open access versionFindings
  • S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollar. A multipath network for object detection. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments