RepPoints V2: Verification Meets Regression for Object Detection

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views242
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose RepPoints v2, which enhances the original regression-based RepPoints by fusing verification tasks in various ways

Abstract:

Verification and regression are two general methodologies for prediction in neural networks. Each has its own strengths: verification can be easier to infer accurately, and regression is more efficient and applicable to continuous target variables. Hence, it is often beneficial to carefully combine them to take advantage of their benefi...More
0
Introduction
  • Two common methodologies for neural network prediction are verification and regression.
  • To take advantage of all these benefits, earlier object localization methods [7, 18, 16] combined verification and regression by first performing coarse localization through verifying several anchor box hypotheses, and refining the localization by regressing box offsets.
  • This combination approach was shown to be effective and led to state-of-the-art performance at the time.
  • Recent methods based purely on regression, which directly regress the object extent from each feature map position [30, 27, 32], could perform competitively or even better, when comparing a representative regression method, RepPoints, to RetinaNet [16]
Highlights
  • Two common methodologies for neural network prediction are verification and regression
  • We propose to model verification tasks by auxiliary side-branches that are added to the major regression branch at only the feature level and result level, without affecting intermediate representations
  • E.g. the top-left corner and bottom-right corner, can determine the spatial extent of a bounding box, providing an alternative to the usual 4-d descriptor consisting of the box’s center point and size. This has been used in several bottom-up object detection methods [13, 35, 28], which in general perform worse than other kinds of detectors in classification, but is significantly better in object localization, as seen in Table 1. We show that this verification task can complement regression based methods to obtain more accurate object localization
  • We propose to model verification tasks by auxiliary side-branches that are fused with the major regression branch in a manner that does not affect its intermediate representations, as illustrated in Figure 1
  • We propose RepPoints v2, which enhances the original regression-based RepPoints by fusing verification tasks in various ways
  • We take this philosophy to improve state-of-the-art object detection, by RepPoints
  • A new variant of RepPoints is proposed to increase the compatibility with the auxiliary verification tasks
Methods
  • Method methodology backbone

    AP AP50 AP60 AP70 AP80 AP90

    CornerNet [13] verification

    RepPoints v2 ver.+reg.
  • AP AP50 AP60 AP70 AP80 AP90. CornerNet [13] verification
Results
  • The performance are elevated by 1.0 mAP, further applying the joint inference, additional 0.3 mAP is improved
  • This demonstrates the flexibility of the proposed method.
  • Figure 3 shows some object detection results comparison on COCO 2017 [17] between RepPoints v1 [30] and RepPoints v2.
  • Both methods adopt ResNet-50 backbone and 1x scheduler.
  • As can be seen, compared to RepPoints v1, RepPoints v2 could provide them more precise localization results
Conclusion
  • The authors propose RepPoints v2, which enhances the original regression-based RepPoints by fusing verification tasks in various ways.
  • The resulting object detector shows consistent improvements over the original RepPoints under different backbones and training approaches.
  • It achieves 52.1 mAP on the COCO test-dev.
  • Backbone AP AP50 AP75 AP90 APS APM APL FCOS.
  • Dense RepPoints +verification backbone ResNet-50 ResNet-50 APmask 37.6 38.9 AP50 60.4 61.5 AP75 40.2 41.9 APS 20.9 21.2 APM 40.5 42.0 APL 48.6 51.1 Broader Impact
  • This approach could be transferred to other detectors and the instance segmentation domain, boosting the performance of the base detector/segmenter by a considerable margin.
Summary
  • Introduction:

    Two common methodologies for neural network prediction are verification and regression.
  • To take advantage of all these benefits, earlier object localization methods [7, 18, 16] combined verification and regression by first performing coarse localization through verifying several anchor box hypotheses, and refining the localization by regressing box offsets.
  • This combination approach was shown to be effective and led to state-of-the-art performance at the time.
  • Recent methods based purely on regression, which directly regress the object extent from each feature map position [30, 27, 32], could perform competitively or even better, when comparing a representative regression method, RepPoints, to RetinaNet [16]
  • Methods:

    Method methodology backbone

    AP AP50 AP60 AP70 AP80 AP90

    CornerNet [13] verification

    RepPoints v2 ver.+reg.
  • AP AP50 AP60 AP70 AP80 AP90. CornerNet [13] verification
  • Results:

    The performance are elevated by 1.0 mAP, further applying the joint inference, additional 0.3 mAP is improved
  • This demonstrates the flexibility of the proposed method.
  • Figure 3 shows some object detection results comparison on COCO 2017 [17] between RepPoints v1 [30] and RepPoints v2.
  • Both methods adopt ResNet-50 backbone and 1x scheduler.
  • As can be seen, compared to RepPoints v1, RepPoints v2 could provide them more precise localization results
  • Conclusion:

    The authors propose RepPoints v2, which enhances the original regression-based RepPoints by fusing verification tasks in various ways.
  • The resulting object detector shows consistent improvements over the original RepPoints under different backbones and training approaches.
  • It achieves 52.1 mAP on the COCO test-dev.
  • Backbone AP AP50 AP75 AP90 APS APM APL FCOS.
  • Dense RepPoints +verification backbone ResNet-50 ResNet-50 APmask 37.6 38.9 AP50 60.4 61.5 AP75 40.2 41.9 APS 20.9 21.2 APM 40.5 42.0 APL 48.6 51.1 Broader Impact
  • This approach could be transferred to other detectors and the instance segmentation domain, boosting the performance of the base detector/segmenter by a considerable margin.
Tables
  • Table1: Analysis of the performance on COCO val set among different methods. “RepPoints*” indicates our improved re-implementation of RepPoints
  • Table2: Performance of the explicit-corners variant of RepPoints
  • Table3: Ablations on two forms of verification
  • Table4: Ablations on three types of fusion
  • Table5: Experiments on RepPoints baselines with stronger backbones using 2× settings (24 epochs) and multi-scale training ([480, 960]) on COCO val set
  • Table6: Comparison of RepPoints v2 to state-of-the-art detectors on COCO test-dev. * denotes that the number is obtained by multi-scale testing
  • Table7: Applying the verification module to FCOS, which is implemented in mmdetection
  • Table8: Adding the verification module to the instance segmentation algorithm Dense RepPoints on COCO test-dev
Download tables as Excel
Related work
  • Verification based object detection Early deep learning based object detection approaches [26, 24] adopt a multi-scale sliding window mechanism to verify whether each window is an object or not. Corner/extreme point based verification is also proposed [28, 13, 35, 4] where the verification of a 4-d hypothesis is factorized into sub-problems of verifying 2-d corners, such that the hypothesis space is more completely covered. A sub-pixel offset branch is typically included in these methods to predict continuous corner coordinates through regression. However, since this mainly deals with quantization error due to the lower resolution of the feature map compared to the input image, we treat these methods as purely verification based in our taxonomy.

    Regression based object detection Achieving object detection by pure regression dates back to YOLO [20] and DenseBox [10], where four box borders are regressed at each feature map position. Though attractive for their simplicity, their accuracy is often limited due to the large displacements of regression targets, the issue of multiple objects located within a feature map bin, and extremely imbalanced positive and negative samples. Recently, after alleviating these issues by a feature pyramid network (FPN) [15] structure along with a focal loss [16], regression-based object detection has regained attention [27, 12, 34, 30], with performance on par or even better than other verification or hybrid methods. Our work advances in this direction, by leveraging verification methodology into regression based detectors without disrupting its flow and largely maintaining the convenience of the original detectors. We mainly base our study on the RepPoints detector, but the method can be generally applied to other regression based detectors.
Funding
  • In this paper, we take this philosophy to improve state-of-the-art object detection, specifically by RepPoints
Reference
  • Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
    Findings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • Ross B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
    Findings
  • Wei Ke, Tianliang Zhang, Zeyi Huang, Qixiang Ye, Jianzhuang Liu, and Dong Huang. Multiple anchor learning for visual object detection. In CVPR, pages 7363–7372, 2020.
    Google ScholarLocate open access versionFindings
  • Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arxiv, abs/1904.03797, 2019.
    Findings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In ICCV, pages 2117–2125, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755.
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37.
    Google ScholarLocate open access versionFindings
  • Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid R-CNN. In CVPR, pages 7363–7372, 2019.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, pages 658–666, 2019.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Guanglu Song, Yu Liu, and Xiaogang Wang. Revisiting the sibling head in object detector. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In NeurIPS, pages 2553–2561, 2013.
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Lachlan Tychsen-Smith and Lars Petersson. Denet: Scalable real-time object detection with directed sparse sampling. In ICCV, pages 428–436, 2017.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
    Google ScholarLocate open access versionFindings
  • Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. Reppoints: Point set representation for object detection. In ICCV, pages 9656–9665, 2019.
    Google ScholarLocate open access versionFindings
  • Ze Yang, Yinghao Xu, Han Xue, Zheng Zhang, Raquel Urtasun, Liwei Wang, Stephen Lin, and Han Hu. Dense reppoints: Representing visual objects with dense point sets. arxiv, abs/1912.11473, 2019.
    Findings
  • Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv preprint arXiv:1912.02424, 2019.
    Findings
  • Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and Qixiang Ye. Freeanchor: Learning to match anchors for visual object detection. In NeurIPS, pages 147–155, 2019.
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arxiv, abs/1904.07850, 2019.
    Findings
  • Xingyi Zhou, Jiacheng Zhuo, and Philipp Krähenbühl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets V2: more deformable, better results. In CVPR, pages 9308–9316, 2019.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments