ThunderNet: Towards Real-time Generic Object Detection

arXiv: Computer Vision and Pattern Recognition, 2019.

被引用0|引用|浏览66|来源
EI
关键词
lightweight backboneRegion Proposal Networkmobile deviceFeature Pyramid Networkcomputer vision更多(11+)
微博一下
We investigate the balance between the input resolution, the backbone, and the detection head

摘要

Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in...更多

代码

数据

0
简介
  • Real-time generic object detection on mobile devices is a crucial but challenging task in computer vision.
  • Modern CNN-based detectors are resource-hungry and require massive computation to achieve ideal detection accuracy, which hinders them from real-time inference in mobile scenarios.
  • From the perspective of network structure, CNN-based detectors can be divided into the backbone part which extracts features for the image and the detection part which detects object instances in the image.
  • MobileNetV1-SSDLite MobileNetV2-SSDLite Pelee Tiny-DSOD 13.8 fps (845)
重点内容
  • Real-time generic object detection on mobile devices is a crucial but challenging task in computer vision
  • We investigate the drawbacks in previous lightweight backbones, and present a lightweight backbone named SNet designed for object detection
  • We investigate the balance between the input resolution, the backbone, and the detection head
  • ThunderNet with SNet146 performs better than Tiny-DSOD by 6.5 mAP under similar computational cost
  • We investigate the effectiveness of two-stage detectors in real-time generic object detection and propose a lightweight two-stage detector named ThunderNet
  • We analyze the drawbacks in prior lightweight backbones and present a lightweight backbone designed for object detection
方法
  • The authors evaluate the effectiveness of ThunderNet on PASCAL VOC [5] and COCO [18] benchmarks.
  • The authors' detectors are trained end-to-end on 4 GPUs using synchronized SGD with a weight decay of 0.0001 and a momentum of 0.9.
  • Multi-scale training with {240, 320, 480} pixels is adopted.
  • The networks are trained for 62.5K iterations on VOC dataset and 375K iterations on COCO dataset.
  • Cross-GPU Batch Normalization (CGBN) [22] is used to learn batch normalization statistics
结果
  • Results on PASCAL VOC

    PASCAL VOC dataset consists of natural images drawn from 20 classes.
  • As shown in Table 3, ThunderNet with SNet49 achieves MobileNet-SSD level accuracy with 22% of the FLOPs. ThunderNet with SNet146 surpasses MobileNet-SSD [11], MobileNet-SSDLite [28], and Pelee [31] with less than 40% of the computational cost.
  • It is noteworthy that the approach achieves considerably better AP75, which suggests the model performs better in localization.
  • This is consistent with the initial motivation to design two-stage real-time detectors.
结论
  • The authors investigate the effectiveness of two-stage detectors in real-time generic object detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, the authors analyze the drawbacks in prior lightweight backbones and present a lightweight backbone designed for object detection.
  • ThunderNet achieves superior detection accuracy to prior one-stage detectors with significantly less computational cost.
  • To the best of the knowledge, ThunderNet achieves the first real-time detector and the fastest single-thread speed reported on ARM platforms
总结
  • Introduction:

    Real-time generic object detection on mobile devices is a crucial but challenging task in computer vision.
  • Modern CNN-based detectors are resource-hungry and require massive computation to achieve ideal detection accuracy, which hinders them from real-time inference in mobile scenarios.
  • From the perspective of network structure, CNN-based detectors can be divided into the backbone part which extracts features for the image and the detection part which detects object instances in the image.
  • MobileNetV1-SSDLite MobileNetV2-SSDLite Pelee Tiny-DSOD 13.8 fps (845)
  • Methods:

    The authors evaluate the effectiveness of ThunderNet on PASCAL VOC [5] and COCO [18] benchmarks.
  • The authors' detectors are trained end-to-end on 4 GPUs using synchronized SGD with a weight decay of 0.0001 and a momentum of 0.9.
  • Multi-scale training with {240, 320, 480} pixels is adopted.
  • The networks are trained for 62.5K iterations on VOC dataset and 375K iterations on COCO dataset.
  • Cross-GPU Batch Normalization (CGBN) [22] is used to learn batch normalization statistics
  • Results:

    Results on PASCAL VOC

    PASCAL VOC dataset consists of natural images drawn from 20 classes.
  • As shown in Table 3, ThunderNet with SNet49 achieves MobileNet-SSD level accuracy with 22% of the FLOPs. ThunderNet with SNet146 surpasses MobileNet-SSD [11], MobileNet-SSDLite [28], and Pelee [31] with less than 40% of the computational cost.
  • It is noteworthy that the approach achieves considerably better AP75, which suggests the model performs better in localization.
  • This is consistent with the initial motivation to design two-stage real-time detectors.
  • Conclusion:

    The authors investigate the effectiveness of two-stage detectors in real-time generic object detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, the authors analyze the drawbacks in prior lightweight backbones and present a lightweight backbone designed for object detection.
  • ThunderNet achieves superior detection accuracy to prior one-stage detectors with significantly less computational cost.
  • To the best of the knowledge, ThunderNet achieves the first real-time detector and the fastest single-thread speed reported on ARM platforms
表格
  • Table1: Architecture of the SNet backbone networks. SNet uses ShuffleNetV2 basic blocks but replaces all 3×3 depthwise convolutions with 5×5 depthwise convolutions
  • Table2: Evaluation results on VOC 2007 test. ThunderNet surpasses competing models with significantly less computational cost
  • Table3: Evaluation results on COCO test-dev. ThunderNet with SNet49 achieves MobileNet-SSD level accuracy with 22% of the FLOPs
  • Table4: Evaluation of different input resolutions on COCO testdev. Large backbones with small images and small backbones with large images are both not optimal
  • Table5: Evaluation of different backbones on ImageNet classification and COCO test-dev. DWConv: depthwise convolution
  • Table6: Evaluation of lightweight backbones on COCO test-dev
  • Table7: Ablation studies on the detection part on COCO testdev. We use a compressed Light-Head R-CNN with SNet146 as the baseline (BL), and gradually add small RPN (SRPN), small R-
  • Table8: MFLOPs and AP of different detection head designs on COCO test-dev. The large-backbone-small-head model outperforms the small-backbone-large-head model with less FLOPs
  • Table9: Inference speed in fps on Snapdragon 845 (ARM), Xeon
Download tables as Excel
相关工作
  • CNN-based object detectors. CNN-based object detectors are commonly classified into two-stage detectors and one-stage detectors. In two-stage detectors, R-CNN [8] is among the earliest CNN-based detection systems. Since then, progressive improvements [9, 7] are proposed for better accuracy and efficiency. Faster R-CNN [27] proposes Region Proposal Network (RPN) to generate regions proposals instead of pre-handled proposals. R-FCN [4] designs a fully convolutional architecture which shares computation on the entire image. On the other hand, one-stage detectors such as SSD [19] and YOLO [24, 25, 26] achieve real-time inference on GPU with very competitive accuracy. RetinaNet [17] proposes focal loss to address the foregroundbackground class imbalance and achieves significant accuracy improvements. In this work, we present a two-stage detector which focuses on efficiency. Real-time generic object detection. Real-time object detection is another important problem for CNN-based detectors. Commonly, one-stage detectors are regarded as the key to real-time detection. For instance, YOLO [24, 25, 26] and SSD [19] run in real time on GPU. When coupled with small backbone networks, lightweight one-stage detectors, such as MobileNet-SSD [11], MobileNetV2-SSDLite [28], Pelee [31] and Tiny-DSOD [13], achieve inference on mobile devices at low frame rates. For two-stage detectors, Light-Head R-CNN [14] utilizes a light detection head and runs at over 100 fps on GPU. This raises a question: are two-stage detectors better than one-stage detectors in realtime detection? In this paper, we present the effectiveness of two-stage detectors in real-time detection. Compared with prior lightweight one-stage detectors, ThunderNet achieves a better balance between accuracy and efficiency. Backbone networks for detection. Modern CNN-based detectors typically adopt image classification networks [30, 10, 32, 12] as the backbones. FPN [16] exploits the inherent multi-scale, pyramidal hierarchy of CNNs to construct feature pyramids. Lightweight detectors also benefit from the recent progress in small networks, such as MobileNet [11, 28] and ShuffleNet [33, 20]. However, image classification and object detection require different properties of networks. Therefore, simply transferring classification networks to object detection is not optimal. For this reason, DetNet [15] designs a backbone specifically for object detection. Recent lightweight detectors [31, 13] also design specialized backbones. However, this area is still not well studied. In this work, we investigate the drawbacks of prior lightweight backbones and present a lightweight backbone for real-time detection task.
引用论文
  • N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Softnms–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, pages 5561–5569, 2017.
    Google ScholarLocate open access versionFindings
  • Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. arXiv preprint arXiv:1712.00726, 2017.
    Findings
  • F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
    Google ScholarLocate open access versionFindings
  • J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010.
    Google ScholarLocate open access versionFindings
  • C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
    Findings
  • R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pages 346–361.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    Findings
  • J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Li, J. Li, W. Lin, and J. Li. Tiny-dsod: Lightweight object detection for resource-restricted usages. arXiv preprint arXiv:1807.11013, 2018.
    Findings
  • Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Lighthead r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264, 2017.
    Findings
  • Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Detnet: Design backbone for object detection. In The European Conference on Computer Vision (ECCV), 2018.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980– 2988, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37.
    Google ScholarLocate open access versionFindings
  • N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.
    Google ScholarLocate open access versionFindings
  • R. Mehta and C. Ozturk. Object detection at 200 frames per second. arXiv preprint arXiv:1805.06361, 2018.
    Findings
  • C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6181–6189, 2018.
    Google ScholarLocate open access versionFindings
  • C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
    Google ScholarLocate open access versionFindings
  • J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
    Google ScholarLocate open access versionFindings
  • J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
    Findings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
    Google ScholarLocate open access versionFindings
  • A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • R. J. Wang, X. Li, and C. X. Ling. Pelee: a real-time object detection system on mobile devices. In Advances in Neural Information Processing Systems, pages 1963–1972, 2018.
    Google ScholarLocate open access versionFindings
  • S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
    Google ScholarLocate open access versionFindings
  • X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论