AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We explore considering the inter-scale correlation through a pyramid convolution, which runs a 3-D convolution on both the scale and spatial dimension of the feature pyramid

Scale-Equalizing Pyramid Convolution for Object Detection

CVPR, pp.13356-13365, (2020)

被引用1|浏览615
EI
下载 PDF 全文
引用
微博一下

摘要

Feature pyramid has been an efficient method to extract features at different scales. Development over this method mainly focuses on aggregating contextual information at different levels while seldom touching the inter-level correlation in the feature pyramid. Early computer vision methods extracted scale-invariant features by locating...更多
0
简介
  • An object may appear in vastly different scales in natural images and yet should be recognized as the same.
  • Multi-scale inference [27] shares the same idea with traditional image pyramid methods [26, 30].
  • Intrinsic feature pyramid [24] in CNNs at different stages provides an efficient alternative to image pyramid.
  • Intrinsic properties of the feature pyramid are not explored to let all feature maps contribute without distinction
重点内容
  • An object may appear in vastly different scales in natural images and yet should be recognized as the same
  • We develop a scale-equalizing pyramid convolution (SEPC) to relax the discrepancy between the feature pyramid and the Gaussian pyramid by aligning the shared pyramid convolution kernel only at high-level feature maps
  • We explore considering the inter-scale correlation through a pyramid convolution (PConv), which runs a 3-D convolution on both the scale and spatial dimension of the feature pyramid
  • The striding pattern for this pyramid convolution in both spatial and scale dimension is quite different from conventional ones
  • Due to the different spatial sizes in the pyramid, the striding step of the spatial slices of pyramid convolution kernels is proportional to the convolved feature map size in the pyramid level
  • This pendulum-alike striding pattern of the pyramid convolution kernel helps align the spatial position of neighboring feature maps as they are involved in one pyramid convolution
方法
  • Cascade-RCNN− [1] ResNet-101 min800 42.8 TridentDet [16] min800 42.7 SNIP∗ [33] DCN+ResNet-101 SNIPPER∗ [34]
结果
  • The performance of FSAF is further boosted to 40.9 by SEPC-lite, which is even 0.5AP higher than that of Cascade and Deformable Faster-RCNN while maintaining more than 20% faster.
结论
  • The authors explore considering the inter-scale correlation through a pyramid convolution (PConv), which runs a 3-D convolution on both the scale and spatial dimension of the feature pyramid.
  • Due to the different spatial sizes in the pyramid, the striding step of the spatial slices of PConv kernels is proportional to the convolved feature map size in the pyramid level.
  • This pendulum-alike striding pattern of the PConv kernel helps align the spatial position of neighboring feature maps as they are involved in one PConv.
  • Being light-weighted and compatible with most object detectors, SEPC is able to significantly improve the detection performance with minimal computational cost increase
表格
  • Table1: Comparison of detection AP results of different architectures. All models were trained using ResNet-50 backbone and adopted the 1x training strategy. Results were evaluated on COCO minival set
  • Table2: Comparison of PConv with other feature fusion modules including FPN [<a class="ref-link" id="c19" href="#r19">19</a>], HR-Net [<a class="ref-link" id="c36" href="#r36">36</a>], PA-Net [<a class="ref-link" id="c23" href="#r23">23</a>], NAS-FPN [<a class="ref-link" id="c8" href="#r8">8</a>] and Libra [<a class="ref-link" id="c29" href="#r29">29</a>] on FreeAnchor. Results evaluated on COCO minival are reported
  • Table3: Comparing of the single-model & single-scale test results of SEPC with other state-of-the-art object detectors. Results are evaluated on test-dev
  • Table4: Extension of only PConv module to two-stage detectors including Faster R-CNN [<a class="ref-link" id="c31" href="#r31">31</a>], Mask R-CNN [<a class="ref-link" id="c11" href="#r11">11</a>] and HTC [<a class="ref-link" id="c2" href="#r2">2</a>]
Download tables as Excel
相关工作
  • 2.1. Object detection

    Modern object detection architectures are generally divided into one-stage and two-stage ones. Two-stage detec-

    Pyramid convolution stride=0.5 stride=0.5

    stride=1 stride=1 stride=2 stride=2

    tion representatives like SPP [12], Fast R-CNN [9], Faster R-CNN [31] first extract region proposals and then classify each of them. The scale variance problem is somewhat mitigated in two-stage detectors where objects of different sizes are rescaled to be the same size during the ROI pooling process. On the other hand, single-stage object detection [24] directly utilizes the intrinsic sliding-window trait of convolutions to build feature pyramids and directly predict objects based on each pixel. Though having earned advantage in real-time tasks due to its fast inference, single-stage detectors has been lagging behind two-stage ones as for the performance. RetinaNet [20] is a milestone single-stage detector since it boosts detection performance by adopting focal loss and new design of detection head. Following works further accelerate the model and improve its performance simultaneously by viewing object detection as key point localization tasks and thus removing the dependency on multiple anchors at each feature map [44, 38]. But the design of FPN and head remains the same as RetinaNet.
基金
  • Shows that the naive pyramid convolution, together with the design of RetinaNet head, best applies for extracting features from a Gaussian pyramid, whose properties can hardly be satisfied by a feature pyramid
  • Proposes to capture the inter-scale interactions through an explicit convolution in the scale dimension, forming a 3-D convolution in the feature pyramid, termed pyramid convolution
  • Explores the possibility of relaxing these two discrepancies by devising a scaleequalizing module
  • Proposes a light-weighted pyramid convolution to conduct 3-D convolution inside the feature pyramid to cater for inter-scale correlation.
  • Develops a scale-equalizing pyramid convolution to relax the discrepancy between the feature pyramid and the Gaussian pyramid by aligning the shared PConv kernel only at high-level feature maps.
引用论文
  • Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4974–4983, 2019.
    Google ScholarLocate open access versionFindings
  • Yuntao Chen, Chenxia Han, Naiyan Wang, and Zhaoxiang Zhang. Revisiting feature alignment for one-stage object detection. arXiv preprint arXiv:1908.01570, 2019.
    Findings
  • Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6569–6578, 2019.
    Google ScholarLocate open access versionFindings
  • Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
    Findings
  • Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019.
    Google ScholarLocate open access versionFindings
  • Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. Ron: Reverse connection with objectness prior networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5936–5944, 2017.
    Google ScholarLocate open access versionFindings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
    Google ScholarLocate open access versionFindings
  • Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yi Li, Zhanghui Kuang, Yimin Chen, and Wayne Zhang. Data-driven neuron allocation for scale aggregation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11526–11534, 2019.
    Google ScholarLocate open access versionFindings
  • Zuoxin Li and Fuqiang Zhou. Fssd: feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960, 2017.
    Findings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • Tony Lindeberg. Scale-space theory: A basic tool for analyzing structures at different scales. Journal of applied statistics, 21(1-2):225–270, 1994.
    Google ScholarLocate open access versionFindings
  • Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018.
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37.
    Google ScholarLocate open access versionFindings
  • David G Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
    Google ScholarLocate open access versionFindings
  • G Lowe. Sift-the scale invariant feature transform. Int. J, 2:91–110, 2004.
    Google ScholarLocate open access versionFindings
  • Mahyar Najibi, Bharat Singh, and Larry S Davis. Autofocus: Efficient multi-scale inference. 2019.
    Google ScholarFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499.
    Google ScholarLocate open access versionFindings
  • Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 821–830, 2019.
    Google ScholarLocate open access versionFindings
  • Marco Pedersoli, Jordi Gonzalez, and Juan Jose Villanueva. High-speed human detection using a multiresolution cascade of histograms of oriented gradients. In Iberian Conference on Pattern Recognition and Image Analysis, pages 48–55.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 2015.
    Google ScholarFindings
  • Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3578–3587, 2018.
    Google ScholarLocate open access versionFindings
  • Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training. In Advances in Neural Information Processing Systems, pages 9310–9320, 2018.
    Google ScholarLocate open access versionFindings
  • Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5693–5703, 2019.
    Google ScholarLocate open access versionFindings
  • Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
    Findings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
    Google ScholarLocate open access versionFindings
  • Yuxin Wang, Hongtao Xie, Zilong Fu, and Yongdong Zhang. Dsrn: a deep scale relationship network for scene text detection. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 947–953. AAAI Press, 2019.
    Google ScholarLocate open access versionFindings
  • Sanghyun Woo, Soonmin Hwang, Ho-Deok Jang, and In So Kweon. Gated bidirectional feature pyramid network for accurate one-shot detection. Machine Vision and Applications, 30(4):543–555, 2019.
    Google ScholarLocate open access versionFindings
  • Daniel E Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
    Google ScholarLocate open access versionFindings
  • Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. Reppoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 9657–9666, 2019.
    Google ScholarLocate open access versionFindings
  • Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018.
    Google ScholarLocate open access versionFindings
  • Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and Qixiang Ye. FreeAnchor: Learning to match anchors for visual object detection. In Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibing Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In The Thirty-Third AAAI Conference on Artificial Intelligence,AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 840–849, 2019.
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.
    Google ScholarLocate open access versionFindings
  • 2. Implementation pseudocode of PConv and SEPC
    Google ScholarLocate open access versionFindings
  • 3. Discussion about remark 1
    Google ScholarFindings
  • 4. Experiment details
    Google ScholarFindings
  • 5. Supplementary ablation experiments
    Google ScholarFindings
  • 6. Details of FSAF
    Google ScholarFindings
  • [1] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019. 1
    Findings
  • [2] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. 4
    Google ScholarLocate open access versionFindings
  • [3] Tony Lindeberg. Scale-space theory: A basic tool for analyzing structures at different scales. Journal of applied statistics, 21(12):225–270, 1994. 3
    Google ScholarLocate open access versionFindings
  • [4] Daniel E Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In Neural Information Processing Systems, 2019. 3 loss loss loss iter iter iter (a)
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科