SaccadeNet: A Fast and Accurate Object Detector

CVPR, pp. 10394-10403, 2020.

Cited by: 0|Bibtex|Views57|Links
EI
Keywords:
visual systemcertain objectobject locationstage object detectioninformative object keypointMore(12+)
Weibo:
Our model actively attends to informative object keypoints from the center to the corners, and predicts the object bounding boxes from coarse to fine

Abstract:

Object detection is an essential step towards holistic scene understanding. Most existing object detection algorithms attend to certain object areas once and then predict the object locations. However, neuroscientists have revealed that humans do not look at the scene in fixed steadiness. Instead, human eyes move around, locating inform...More

Code:

Data:

0
Introduction
  • As the first gate to perceive the physical world, the visual system glances at a scene and immediately understands what objects are there and where they are.
  • A fast and accurate object detector is essential, which can allow machines to perceive the physical world efficiently and effectively, and unlock subsequent processes such as understanding the holistic scene and interacting within it.
  • The time-consuming region proposal stage is an bottleneck of inference speed
Highlights
  • The human visual system is accurate and fast
  • Anchor-based methods [24, 23, 16, 18, 7] proposed to pre-define a large amount of anchor locations, and either directly regress object bounding box locations, or generate region proposals based on anchors and decide whether each region contains a certain object category
  • We propose a fast and accurate object detector, named SaccadeNet, which effectively attends to informative object keypoints, and predicts object locations from coarse to fine
  • Our model actively attends to informative object keypoints from the center to the corners, and predicts the object bounding boxes from coarse to fine
  • We extensively evaluate SaccadeNet on PASCAL VOC and MS COCO datasets, which both demonstrates its effectiveness and efficiency
Methods
  • The experiments are conducted on 2 datasets, PASCAL VOC 2012 [6] and MS COCO [17]. MS COCO dataset contains 80 categories, including 105k images for training (train2017) and 5k images for validation (val2017).
  • Pascal VOC consists of 20 categories and it contains a training set of 17k images and a validation set of 5k images.
  • This setting is the same as previous work [13, 5, 8, 29].
  • For fair comparison and to illustrate the effectiveness of SaccadeNet, the authors keep all the settings of backbone the same as [29]
Results
  • On the MS COCO dataset, the authors achieve the performance of 40.4\% mAP at 28 FPS and 30.5\% mAP at 118 FPS.
  • SaccadeNet-Res18 is the first real-time anchor-free detector that achieves more than 30% mAP on MS COCO val2017 with speed faster than 100 FPS
Conclusion
  • The authors introduce SaccadeNet, a fast and accurate object detection algorithm.
  • The authors' model actively attends to informative object keypoints from the center to the corners, and predicts the object bounding boxes from coarse to fine.
  • SaccadeNet runs extremely fast, because these object keypoints are predicted jointly so that the authors do not need a grouping algorithm to combine them.
  • The authors extensively evaluate SaccadeNet on PASCAL VOC and MS COCO datasets, which both demonstrates its effectiveness and efficiency
Summary
  • Introduction:

    As the first gate to perceive the physical world, the visual system glances at a scene and immediately understands what objects are there and where they are.
  • A fast and accurate object detector is essential, which can allow machines to perceive the physical world efficiently and effectively, and unlock subsequent processes such as understanding the holistic scene and interacting within it.
  • The time-consuming region proposal stage is an bottleneck of inference speed
  • Methods:

    The experiments are conducted on 2 datasets, PASCAL VOC 2012 [6] and MS COCO [17]. MS COCO dataset contains 80 categories, including 105k images for training (train2017) and 5k images for validation (val2017).
  • Pascal VOC consists of 20 categories and it contains a training set of 17k images and a validation set of 5k images.
  • This setting is the same as previous work [13, 5, 8, 29].
  • For fair comparison and to illustrate the effectiveness of SaccadeNet, the authors keep all the settings of backbone the same as [29]
  • Results:

    On the MS COCO dataset, the authors achieve the performance of 40.4\% mAP at 28 FPS and 30.5\% mAP at 118 FPS.
  • SaccadeNet-Res18 is the first real-time anchor-free detector that achieves more than 30% mAP on MS COCO val2017 with speed faster than 100 FPS
  • Conclusion:

    The authors introduce SaccadeNet, a fast and accurate object detection algorithm.
  • The authors' model actively attends to informative object keypoints from the center to the corners, and predicts the object bounding boxes from coarse to fine.
  • SaccadeNet runs extremely fast, because these object keypoints are predicted jointly so that the authors do not need a grouping algorithm to combine them.
  • The authors extensively evaluate SaccadeNet on PASCAL VOC and MS COCO datasets, which both demonstrates its effectiveness and efficiency
Tables
  • Table1: The experiments are conducted on MS COCO test-dev. SaccadeNet-DLA outperforms CenterNet-DLA by 1.2% mAP with little overhead. This is the first detector that achieves more than 40% mmAP on MS COCO test-dev with more than 25 FPS. SaccadeNet-Res18 outperforms CenterNet-Res18 by 2.4% mAP with small overhead. We show naive/flip testing results of CenterNet and SaccadeNet. A dash indicates the method doesn’t provide the result. ∗ means the experiments are conducted on MS COCO val2017
  • Table2: This table shows the results of SaccadeNet with or without Aggregation-Attn and Corner-Attn. We use 6 metrics of different IoU thresholds and object sizes. All experiments are conducted on Pascal VOC. For our approaches, we show both the mAP and the mAP gain (+) or loss (-) compared with the baseline
  • Table3: All experiments are conducted on MS COCO val2017. PP and IoU represent peak-picking NMS and IoU-based NMS, respectively
  • Table4: This table shows the results of using different points for Corner-Attn on PASCAL VOC with ResNet-18. Corner represents the original SaccadeNet-Res18. Diag Pts@t (t is a float number) represents the points locating at pct ∗ (1 − t) + pcr ∗ t, where pct, pcr represents the position of centers and corners. Similarly, Midedge Pts@t represents a points locating at pct ∗ (1 − t) + pml ∗ t, where pct and pml indicate center points and middle points of an edge of object bounding box
  • Table5: The table shows the results of applying iterative refinement on SaccadeNet with different IoU thresholds. All the experiments are based on ResNet-18 on PASCAL VOC. Num of iter means the number of iterations used for boundary refinement
  • Table6: This table shows the results of using Aggregation-AttnCls for classification with different IoU thresholds. All Experiments are performed on Pascal VOC with ResNet-18
  • Table7: This table shows the results of using different inputs for Aggregation-Attn with different IoU thresholds. All Experiments are performed on Pascal VOC with ResNet-18
Download tables as Excel
Related work
  • Modern object detectors can be roughly divided into two categories: anchor-based object detectors and anchor-free object detectors.

    2.1. Anchor-based Detectors

    After the seminal work of Faster R-CNN [24], anchors have been widely used in modern detectors. It usually contains two stages. The first-stage module is a region proposal network (RPN), which estimates the objectness probabilities of all anchors and regresses the offsets between object boundaries and anchors. The second stage is R-CNN, which predicts the category probability and refines the boundary of bounding box.

    Recently, anchor-based one-stage approaches [23, 16, 18, 7] have drawn much attention in object detection because the architectures are simpler and usually run faster [23]. They remove the RPN and directly predict the categories and regress the boxes of candidate anchors. However, the performance of anchor-based one-stage detectors are usually lower than multi-stage detectors due to the extreme imbalance between positive and negative anchors during training.
Funding
  • Gang Hua was supported partly by National Key R&D Program of China Grant 2018AAA0101400 and NSFC Grant 61629301
Reference
  • Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hierarchical shot detector. In Proceedings of the IEEE International Conference on Computer Vision, pages 9705–9714, 2019. 1, 6
    Google ScholarLocate open access versionFindings
  • Yuntao Chen, Chenxia Han, Naiyan Wang, and Zhaoxiang Zhang. Revisiting feature alignment for one-stage object detection. arXiv preprint arXiv:1908.01570, 2019. 2
    Findings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Heiner Deubel and Werner X Schneider. Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision research, 36(12):1827–1837, 1996. 2, 3
    Google ScholarLocate open access versionFindings
  • Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6569–6578, 2019. 1, 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. 4
    Google ScholarLocate open access versionFindings
  • Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2011, 2
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 4, 5, 6
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 4, 5
    Google ScholarLocate open access versionFindings
  • Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015. 2
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5
    Findings
  • Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arXiv preprint arXiv:1904.03797, 2019. 2
    Findings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018. 1, 2, 3, 4, 6
    Google ScholarLocate open access versionFindings
  • Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. 2019. 6
    Google ScholarFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. 4
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and Yinan Yu. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5187–5196, 202
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 1, 2, 5, 6
    Findings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Bharat Singh, Mahyar Najibi, and Larry S Davis. SNIPER: Efficient Multi-Scale Training. In NeurIPS, 2018. 6
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. Reppoints: Point set representation for object detection. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018. 2, 4, 5
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as Points. arXiv, 2019. 1, 2, 3, 4, 5, 6
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 850–859, 2019. 1, 2, 4, 6
    Google ScholarLocate open access versionFindings
  • Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621, 2019. 2
    Findings
  • Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019. 2, 4
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments