Learning Region Features for Object Detection

ECCV, pp. 392-406, 2018.

Cited by: 18|Bibtex|Views73|DOI:https://doi.org/10.1007/978-3-030-01258-8_24
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Most steps become learnable in recent years, including image feature generation, region proposal, and duplicate removal

Abstract:

While most steps in the modern object detection methods are learnable, the region feature extraction step remains largely hand-crafted, featured by RoI pooling methods. This work proposes a general viewpoint that unifies existing region feature extraction methods and a novel method that is end-to-end learnable. The proposed method removes...More

Code:

Data:

0
Introduction
  • A noteworthy trait in the deep learning era is that many hand-crafted features, algorithm components, and design choices, are replaced by their data-driven and learnable counterparts.
  • The second contribution is a learnable module that represents the weights in terms of the RoI and image features.
  • A general formulation is to treat the part feature as the weighted summation of image features x over all positions within a support region Ωb, as yk(b) = wk(b, p, x) ⊙ x(p).
Highlights
  • A noteworthy trait in the deep learning era is that many hand-crafted features, algorithm components, and design choices, are replaced by their data-driven and learnable counterparts
  • Most steps become learnable in recent years, including image feature generation [6], region proposal [21,5,20], and duplicate removal [11,12]
  • Note that region recognition step is learning based in nature
  • For Faster R-CNN, following the practice in [3,4], the conv4 and conv5 image features are utilized for region proposal generation and object detection, respectively
  • Gaussian weights (σ = 0.01), and their learning rates are kept the same as the existing layers. In both Faster R-CNN and FPN, to facilitate experiments, separate networks are trained for region proposal generation and object detection, without sharing their features
  • Region-based object detection should not stop at hand-crafted binning based feature extraction, including deformable ROI pooling
Results
  • Because the weight in Eq (4) depends on the bin center, the region features are sensitive to even subtle changes in the position of the RoI.
  • The submodule starts with a regular RoI pooling to extract an initial region feature from image feature, which is used to regress offsets through an additional learnable fully connected layer.
  • Note that the supporting region Ωb is no longer the RoI as in regular and aligned pooling, but potentially spans the whole image, because the learnt offsets could be arbitrarily large, in principle.
  • For Faster R-CNN, following the practice in [3,4], the conv4 and conv5 image features are utilized for region proposal generation and object detection, respectively.
  • The authors follow the network design in [14], and just replace RoI pooling by the proposed learnable region feature extraction module.
  • In both Faster R-CNN and FPN, to facilitate experiments, separate networks are trained for region proposal generation and object detection, without sharing their features.
  • For all the following experiments, the method will utilize the sparse sampling implementation with 196 maximum sampling positions for both ΩbIn and ΩbOut. Effect of geometric relation and appearance feature terms.
  • The accuracy is on par with deformable RoI pooling, which exploits appearance features to guide the region feature extraction process.
  • The authors further compare the proposed module with regular, aligned and deformable versions of RoI pooling on stronger detection backbones, where FPN and ResNet-101 are utilized.
  • Previous regular ROI binning methods are clearly limited as they are too hand-crafted and do not exploit the image context well.
Conclusion
  • Region-based object detection should not stop at hand-crafted binning based feature extraction, including deformable ROI pooling.
  • Quantitative Analysis For each part k, the weights wk(∗) are treated as a probability distribution over all the positions in the supporting region Ω, as backbone method mAP mAP50 mAP75 mAPS mAPM mAPL
  • For each ground truth object RoI, the weights from all the parts are aggregated together by taking the maximum value at each position, resulting in a “max pooled weight map”.
Summary
  • A noteworthy trait in the deep learning era is that many hand-crafted features, algorithm components, and design choices, are replaced by their data-driven and learnable counterparts.
  • The second contribution is a learnable module that represents the weights in terms of the RoI and image features.
  • A general formulation is to treat the part feature as the weighted summation of image features x over all positions within a support region Ωb, as yk(b) = wk(b, p, x) ⊙ x(p).
  • Because the weight in Eq (4) depends on the bin center, the region features are sensitive to even subtle changes in the position of the RoI.
  • The submodule starts with a regular RoI pooling to extract an initial region feature from image feature, which is used to regress offsets through an additional learnable fully connected layer.
  • Note that the supporting region Ωb is no longer the RoI as in regular and aligned pooling, but potentially spans the whole image, because the learnt offsets could be arbitrarily large, in principle.
  • For Faster R-CNN, following the practice in [3,4], the conv4 and conv5 image features are utilized for region proposal generation and object detection, respectively.
  • The authors follow the network design in [14], and just replace RoI pooling by the proposed learnable region feature extraction module.
  • In both Faster R-CNN and FPN, to facilitate experiments, separate networks are trained for region proposal generation and object detection, without sharing their features.
  • For all the following experiments, the method will utilize the sparse sampling implementation with 196 maximum sampling positions for both ΩbIn and ΩbOut. Effect of geometric relation and appearance feature terms.
  • The accuracy is on par with deformable RoI pooling, which exploits appearance features to guide the region feature extraction process.
  • The authors further compare the proposed module with regular, aligned and deformable versions of RoI pooling on stronger detection backbones, where FPN and ResNet-101 are utilized.
  • Previous regular ROI binning methods are clearly limited as they are too hand-crafted and do not exploit the image context well.
  • Region-based object detection should not stop at hand-crafted binning based feature extraction, including deformable ROI pooling.
  • Quantitative Analysis For each part k, the weights wk(∗) are treated as a probability distribution over all the positions in the supporting region Ω, as backbone method mAP mAP50 mAP75 mAPS mAPM mAPL
  • For each ground truth object RoI, the weights from all the parts are aggregated together by taking the maximum value at each position, resulting in a “max pooled weight map”.
Tables
  • Table1: Top: description and typical values of main variables. Bottom: computational complexity of the proposed method. †Using default maximum sample numbers as in Eq (10) and (11), the average actual sample number is about 200. See also Table 3. *Note that we decompose Wkbox as Wkbox = WkboxV box, and the total computational cost is the sum of two matrix multiplications V box · Ebox (the multiplication result is denoted as Ebox) and Wkbox · Ebox. See also Section 3 for details
  • Table2: Comparison of three region feature extraction methods using different support regions. Accuracies are reported on COCO detection minival set. *It is not clear how to exploit the whole image for regular and aligned RoI pooling methods. Hence the corresponding accuracy numbers are omitted
  • Table3: Detection accuracy and computational times of efficient method using different number of sample points. The average samples |ΩbOut|avg and |ΩbIn|avg are counted on COCO minival set using 300 ResNet-50 RPN proposals. The bold row (|ΩbOut|max = 196, |ΩbIn|max = 142) are used as our default maximum sample point number. *full indicates that all image positions are used without any sampling
  • Table4: Effect of geometric and appearance terms in Eq (7) for the proposed region feature extraction module. Detection accuracies are reported on COCO minival set
  • Table5: Comparison of different algorithms using different backbones. Accuracies on COCO test-dev are reported
Download tables as Excel
Related work
  • Besides the RoI pooling methods reviewed above, there are more region feature extraction methods that can be thought of specializations of Eq (2) or its more general extension.

    Region Feature Extraction in One-stage Object Detection [17,19,15] As opposed to the two-stage or region based object detection paradigm, another paradigm is one-stage or dense sliding window based. Because the number of windows (regions) is huge, each region feature is simply set as the image feature on the region’s center point, which can be specialized from Eq (2) as K = 1, Ωb = {center(b)}. This is much faster but less accurate than RoI pooling methods.

    Pooling using Non-grid Bins [1,23] These methods are similar to regular pooling but change the definition of Rbk in Eq (3) to be non-grid. For example, MaskLab [1] uses triangle-shaped bins other than rectangle ones. It shows better balance in encoding center-close and center-distant subregions. In Interpretable R-CNN [23], the non-grid bins are generated from the grammar defined by an AND-OR graph model.
Funding
  • Liwei Wang was partially supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026), BJNSF (L172037), and a grant from Microsoft Research Asia
Reference
  • Chen, L.C., Hermans, A., Papandreou, G., Schroff, F., Wang, P., Adam, H.: MaskLab: Instance segmentation by refining object detection with semantic and direction features. CVPR (2018)
    Google ScholarLocate open access versionFindings
  • Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: CVPR (2016)
    Google ScholarFindings
  • Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. In: NIPS (2016)
    Google ScholarFindings
  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV (2017)
    Google ScholarFindings
  • Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: CVPR. pp. 2147–2154 (2014)
    Google ScholarFindings
  • Girshick, R.: Fast R-CNN. In: ICCV (2015)
    Google ScholarFindings
  • Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
    Google ScholarFindings
  • He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. ICCV (2017)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: ECCV (2014)
    Google ScholarFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
    Google ScholarFindings
  • Hosang, J., Benenson, R., Schiele, B.: Learning non-maximum suppression. In: ICCV (2017)
    Google ScholarFindings
  • Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR (2018)
    Google ScholarFindings
  • Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: Light-head R-CNN: In defense of two-stage object detector. CVPR (2018)
    Google ScholarLocate open access versionFindings
  • Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
    Google ScholarFindings
  • Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. ICCV (2017)
    Google ScholarLocate open access versionFindings
  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV (2014)
    Google ScholarLocate open access versionFindings
  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.: SSD: Single shot multibox detector. In: ECCV (2016)
    Google ScholarFindings
  • Mordan, T., Thome, N., Cord, M., Henaff, G.: Deformable part-based fully convolutional network for object detection. arXiv preprint arXiv:1707.06175 (2017)
    Findings
  • Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR (2016)
    Google ScholarFindings
  • Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS (2015)
    Google ScholarFindings
  • Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-quality object detection. arXiv preprint arXiv:1412.1441v2 (2014)
    Findings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS (2017)
    Google ScholarFindings
  • Wu, T., Li, X., Song, X., Sun, W., Dong, L., Li, B.: Interpretable R-CNN. arXiv preprint arXiv:1711.05226 (2017)
    Findings
Full Text
Your rating :
0

 

Tags
Comments