FastMask: Segment Multi-scale Object Candidates in One Shot

CVPR, pp. 2280-2288, 2017.

Cited by: 20|Bibtex|Views51|Links
EI
Keywords:
object proposalneck moduledeep convolutionalart segmentregion proposal networkMore(15+)
Weibo:
Instead of building pyramid of input image, FastMask learns to encode feature pyramid by a neck module, and performs one-shot training and inference

Abstract:

Objects appear to scale differently in natural images. This fact requires methods dealing with object-centric tasks (e.g. object proposal) to have robust performance over variances in object scales. In the paper, we present a novel segment proposal framework, namely FastMask, which takes advantage of hierarchical features in deep convolut...More

Code:

Data:

0
Introduction
  • Object proposal is considered as the first and fundamental step in object detection task [8, 25, 1, 16, 10, 29].
  • Different from traditional object proposal methods, segment proposal algorithms are expected to generate a pixel-wise segment instead of a bounding box for each object.
  • From this perspective, segment proposal inherits from both object proposal and image segmentation, and takes a step further towards simultaneous detection and segmentation [11], which brings more challenges to overcome.
  • Compared to bounding-box-based object proposal, scale variance becomes a more serious problem for
Highlights
  • Object proposal is considered as the first and fundamental step in object detection task [8, 25, 1, 16, 10, 29]
  • We introduce the concept of neck module, whose job is to recurrently zoom out the feature maps extracted by the body module into feature pyramids, and feed the feature pyramids into the head module for multi-scale inference
  • We evaluate our framework on MS COCO benchmark [18] and it achieves the state-of-the-art results while running in near real time
  • As average recall correlates well with object proposal quality [15], we summarize Average Recall (AR) between IoU 0.5 and 0.95 for a fixed number N of proposals, denoted as ”AR@N” in order to measure the performance of algorithms. (We use N equals to 10, 100 and 1000) Scales
  • In this paper we present an innovative framework, i.e. FastMask, for efficient segment-based object proposal
  • Instead of building pyramid of input image, FastMask learns to encode feature pyramid by a neck module, and performs one-shot training and inference
Methods
  • Avg-Pooling Max-Pooling Feed-Forward Residual

    AR@100 ARS@100 ARM @100 ARL@100

    a 1 × 1 one) to zoom out feature maps, in order to reduce the the smooth effect of average pooling as well as preserve feature semantics.
  • Note that the authors obtain a large margin in average recall for objects in large scale, which are decoded from the top feature maps
  • This verifies the effectiveness of the residual neck in encoding feature pyramid.
  • In the DeepMaskZoom∗ and SharpMaskZoom, they inference on images scaled from 2ˆ[-2.5, -2.0, -1.5, -1.0, -0.5, 0, 0.5, 1] to obtain superior performance on a diverse range of object segments
  • This is similar to the two-stream network, where the authors input a image up-sampled by two.
  • The authors' most effective model takes a two-stream structure with 39-layer ResNet; The authors' fastest model takes a one-stream structure with PvaNet [14], which is light-weight and fast
Results
  • According to Table 3, the authors outperform all state-of-the-art methods in bounding-box proposal by a large margin and obtain very competitive results with segmentation proposals.
Conclusion
  • In this paper the authors present an innovative framework, i.e. FastMask, for efficient segment-based object proposal.
  • A scale-tolerant head module is proposed to highlight the foreground object from its background noises, havesting a significant better segmentation accuracy.
  • On MS COCO benchmark, FastMask outperforms all state-of-the-art segment proposal methods in average recall while keeping several times faster.
  • As an effective and efficient segment proposal method, FastMask is believed to have great potentials in other tasks
Summary
  • Introduction:

    Object proposal is considered as the first and fundamental step in object detection task [8, 25, 1, 16, 10, 29].
  • Different from traditional object proposal methods, segment proposal algorithms are expected to generate a pixel-wise segment instead of a bounding box for each object.
  • From this perspective, segment proposal inherits from both object proposal and image segmentation, and takes a step further towards simultaneous detection and segmentation [11], which brings more challenges to overcome.
  • Compared to bounding-box-based object proposal, scale variance becomes a more serious problem for
  • Objectives:

    The authors aim to address the scale variances in segment proposal by leveraging the hierarchical feature pyramid[9] from convolutional neural networks (CNN)
  • Methods:

    Avg-Pooling Max-Pooling Feed-Forward Residual

    AR@100 ARS@100 ARM @100 ARL@100

    a 1 × 1 one) to zoom out feature maps, in order to reduce the the smooth effect of average pooling as well as preserve feature semantics.
  • Note that the authors obtain a large margin in average recall for objects in large scale, which are decoded from the top feature maps
  • This verifies the effectiveness of the residual neck in encoding feature pyramid.
  • In the DeepMaskZoom∗ and SharpMaskZoom, they inference on images scaled from 2ˆ[-2.5, -2.0, -1.5, -1.0, -0.5, 0, 0.5, 1] to obtain superior performance on a diverse range of object segments
  • This is similar to the two-stream network, where the authors input a image up-sampled by two.
  • The authors' most effective model takes a two-stream structure with 39-layer ResNet; The authors' fastest model takes a one-stream structure with PvaNet [14], which is light-weight and fast
  • Results:

    According to Table 3, the authors outperform all state-of-the-art methods in bounding-box proposal by a large margin and obtain very competitive results with segmentation proposals.
  • Conclusion:

    In this paper the authors present an innovative framework, i.e. FastMask, for efficient segment-based object proposal.
  • A scale-tolerant head module is proposed to highlight the foreground object from its background noises, havesting a significant better segmentation accuracy.
  • On MS COCO benchmark, FastMask outperforms all state-of-the-art segment proposal methods in average recall while keeping several times faster.
  • As an effective and efficient segment proposal method, FastMask is believed to have great potentials in other tasks
Tables
  • Table1: Comparison on different designs of the neck modules (on COCO benchmark). VGGNet [<a class="ref-link" id="c24" href="#r24">24</a>] is used as body network for all the necks
  • Table2: Comparison of different head modules on the COCO benchmark. VGGNet [<a class="ref-link" id="c24" href="#r24">24</a>] is used as the body network
  • Table3: Object segment proposal results on COCO validation set for box and segmentation proposals. Note that we also report the body network for each corresponding method
  • Table4: Trade-off between scale density and performance
  • Table5: Speed Study with state-of-the-art methods
Download tables as Excel
Related work
  • Bbox-based object proposal. Most of the bbox-based object proposal methods rely on the dense sliding windows on image pyramid. In EdgeBox [31] and Bing [4], the edge feature is used to make the prediction for each sliding window while the gradient feature is used in [29]. More recently, DeepBox [17] trains a CNN to re-rank the proposals generated by EdgeBox, while MultiBox [7] generates the proposals from convolutional feature maps directly. Ren et. al. [22] presented a region proposal network (RPN) is proposed to handle object candidates in varying scales. Segment-based object proposal. Segments proposal algorithms aim to find diverse regions in an image which are likely to contain objects. Traditional segment proposal methods such as SelectiveSearch [25], MCG [1] and Geodesic [16] first over-segment image into super pixels and then merge the super pixels in a bottom-up fashion. Inspired by the success of CNNs in image segmentation [23, 3, 28], previous works [6, 2] perform segmentation on the bbox-based object proposal results to obtain object segments. As the state-of-the-arts, DeepMask [20] proposes a body-head structure to decode object masks from CNN feature maps, and SharpMask [21] further adds a backward branch to refine the masks. However, all these methods rely on an image pyramid during inference, which limits their application in practice. Visual attention. Instead of using holistic image feature from CNN, a number of recent works [26, 19, 30, 27] have explored visual attention to highlight discriminative region inside images and reduce the effects of noisy background. In this paper we apply such attention mechanism to improve the instance-level segmentation performance.
Funding
  • Sha are partially supported by NSF IIS1065243, 1451412, 1513966, 1208500, CCF-1139148, a Google Research Award, an Alfred
Reference
  • P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 2
    Google ScholarLocate open access versionFindings
  • M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Binarized normed gradients for objectness estimation at 300fps. In CVPR, 2012
    Google ScholarLocate open access versionFindings
  • J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016. 1, 2, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2011, 2
    Google ScholarLocate open access versionFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 6
    Google ScholarLocate open access versionFindings
  • R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 437–446, 2015. 2
    Google ScholarLocate open access versionFindings
  • R. Gokberk Cinbis, J. Verbeek, and C. Schmid. Segmentation driven object detection with fisher vectors. In CVPR, 2013. 1
    Google ScholarFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014. 1
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 3, 4, 6
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016. 4
    Google ScholarLocate open access versionFindings
  • S. Hong, B. Roh, K.-H. Kim, Y. Cheon, and M. Park. Pvanet: Lightweight deep neural networks for real-time object detection. arXiv preprint arXiv:1611.08588, 2016. 6, 8
    Findings
  • J. Hosang, R. Benenson, P. Dollar, and B. Schiele. What makes for effective detection proposals? IEEE T-PAMI, 2016. 6
    Google ScholarLocate open access versionFindings
  • P. Krahenbuhl and V. Koltun. Geodesic object proposals. In ECCV, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning objectness with convolutional networks. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • C. Liu, J. Mao, F. Sha, and A. Yuille. Attention correctness in neural image captioning. arXiv preprint arXiv:1605.09553, 2016. 2
    Findings
  • P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 1, 2, 4, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learning to refine object segments. In ECCV, 2016. 1, 2, 4, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 2, 6
    Google ScholarLocate open access versionFindings
  • E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE T-PAMI, 2016. 2
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 3, 4, 5
    Google ScholarLocate open access versionFindings
  • K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV. IEEE, 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 2
    Google ScholarLocate open access versionFindings
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. 2016. 2
    Google ScholarFindings
  • F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Z. Zhang, J. Warrell, and P. H. Torr. Proposal generation for object detection using cascaded ranking svms. In CVPR, 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. 2016. 2
    Google ScholarFindings
  • C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments