Fully Convolutional Instance-aware Semantic Segmentation

CVPR, 2017.

Cited by: 482|Bibtex|Views94
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We present the first fully convolutional method for instance-aware semantic segmentation

Abstract:

We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all the merits of FCNs for semantic segmentation [29] and instance mask proposal [5]. It performs instance mask prediction and classification jointly. The underlying convolutional representation is fully shared between t...More

Code:

Data:

0
Introduction
  • Convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation.
  • An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories, as illustrated in Figure 1(a).
  • Instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions.
  • This behavior cannot be modeled by a single FCN on the whole image.
Highlights
  • Convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation
  • Conventional Fully convolutional networks (FCNs) do not work for the instance-aware semantic segmentation task, which requires the detection and segmentation of individual object instances
  • Instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions
  • This behavior cannot be modeled by a single FCN on the whole image
  • Following the protocol in [15, 7, 16, 8], model training is performed on the VOC 2012 train set, and evaluation is performed on the VOC 2012 validation set, with the additional instance mask annotations from [14]
  • We present the first fully convolutional method for instance-aware semantic segmentation
Methods
  • Ablation experiments are performed to study the proposed FCIS method on the PASCAL VOC dataset [11].
  • Comparison with MNC The authors compare the proposed FCIS method with MNC [8], the 1st place entry in COCO segmentation challenge 2015.
  • Both methods perform mask prediction and classification in ROIs, and share similar training/inference procedures.
Results
  • The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step.
  • Similar to [42], the FCIS method is applied on the original and the flipped images, and the results in the corresponding ROIs are averaged.
  • This helps increase the accuracy by 0.7%.
Conclusion
  • The authors present the first fully convolutional method for instance-aware semantic segmentation.
  • It extends the existing FCN-based approaches and significantly pushes forward the state-of-the-art in both accuracy and efficiency for the task.
  • The high performance benefits from the highly integrated and efficient network architecture, especially a novel joint formulation
Summary
  • Introduction:

    Convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation.
  • An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories, as illustrated in Figure 1(a).
  • Instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions.
  • This behavior cannot be modeled by a single FCN on the whole image.
  • Methods:

    Ablation experiments are performed to study the proposed FCIS method on the PASCAL VOC dataset [11].
  • Comparison with MNC The authors compare the proposed FCIS method with MNC [8], the 1st place entry in COCO segmentation challenge 2015.
  • Both methods perform mask prediction and classification in ROIs, and share similar training/inference procedures.
  • Results:

    The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step.
  • Similar to [42], the FCIS method is applied on the original and the flipped images, and the results in the corresponding ROIs are averaged.
  • This helps increase the accuracy by 0.7%.
  • Conclusion:

    The authors present the first fully convolutional method for instance-aware semantic segmentation.
  • It extends the existing FCN-based approaches and significantly pushes forward the state-of-the-art in both accuracy and efficiency for the task.
  • The high performance benefits from the highly integrated and efficient network architecture, especially a novel joint formulation
Tables
  • Table1: Ablation study of (almost) fully convolutional methods on PASCAL VOC 2012 validation set
  • Table2: Comparison with MNC [<a class="ref-link" id="c8" href="#r8">8</a>] on COCO test-dev set, using ResNet-101 model. Timing is evaluated on a Nvidia K40 GPU
  • Table3: Results of using networks of different depths in FCIS
  • Table4: Instance-aware semantic segmentation results of different entries for the COCO segmentation challenge (2015 and 2016) on COCO test-dev set
Download tables as Excel
Related work
  • Semantic Image Segmentation The task is to assign every pixel in the image a semantic category label. It does not distinguish object instances. Recently, this field has been dominated by a prevalent family of approaches based on FCNs [29]. The FCNs are extended with global context [28], multi-scale feature fusion [4], and deconvolution [31]. Recent works in [3, 43, 37, 24] integrated FCNs with conditional random fields (CRFs). The expensive CRFs are replaced by more efficient domain transform in [2]. As the per-pixel category labeling is expensive, the supervision signals in FCNs have been relaxed to boxes [6], scribbles [23], or weakly supervised image class labels [19, 20].
Funding
  • The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step
  • The mAPr scores of the naıve MNC baseline are 59.1% and 36.0% at IoU thresholds of 0.5 and 0.7 respectively. They are 5.5% and 12.9% lower than those of the original MNC [8], which keeps 10 layers in ResNet-101 in the per-ROI sub-networks
  • Multiscale testing improves the accuracy by 2.8%
  • Similar to [42], the FCIS method is applied on the original and the flipped images, and the results in the corresponding ROIs are averaged. This helps increase the accuracy by 0.7%
  • By taking the enclosing boxes of the instance masks as detected bounding boxes, it achieves an object detection accuracy of 39.7% on COCO test-dev set, measured by the standard mAPb@[0.5:0.95] score
Study subjects and analysis
cases: 3
Our joint formulation fuses the two answers into two scores: inside and outside. There are three cases: 1) high inside score and low outside score: detection+, segmentation+; 2) low inside score and high outside score: detection+, segmentation-; 3) both scores are low: detection-, segmentation-. The two scores answer the two questions jointly via softmax and max operations

Reference
  • P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014. 5
    Google ScholarLocate open access versionFindings
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 4, 5
    Google ScholarLocate open access versionFindings
  • [4] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [5] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • [6] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015
    Google ScholarLocate open access versionFindings
  • [7] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015. 1, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • [8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016. 1, 2, 3, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • [9] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 5, 6
    Google ScholarLocate open access versionFindings
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2006
    Google ScholarLocate open access versionFindings
  • [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 205, 6
    Google ScholarLocate open access versionFindings
  • [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 4, 5
    Google ScholarLocate open access versionFindings
  • [14] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011. 6
    Google ScholarLocate open access versionFindings
  • [15] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. 3, 5, 6
    Google ScholarLocate open access versionFindings
  • [16] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 1, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 7
    Google ScholarLocate open access versionFindings
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 201, 2, 4, 6, 7
    Google ScholarLocate open access versionFindings
  • [19] S. Hong, H. Noh, and B. Han. Decoupled deep neural network for semi-supervised semantic segmentation. In NIPS, 2015. 5
    Google ScholarLocate open access versionFindings
  • [20] S. Hong, J. Oh, B. Han, and H. Lee. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [21] K. Li, B. Hariharan, and J. Malik. Iterative instance segmentation. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [22] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint, 2015. 5
    Google ScholarFindings
  • [23] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [24] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • [26] S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, 2016. 7
    Google ScholarLocate open access versionFindings
  • [28] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. In ICLR workshop, 2016. 5
    Google ScholarLocate open access versionFindings
  • [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 2, 4, 5
    Google ScholarLocate open access versionFindings
  • [31] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015. 5
    Google ScholarLocate open access versionFindings
  • [32] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 5
    Google ScholarLocate open access versionFindings
  • [33] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learning to refine object segments. In ECCV, 2016. 5
    Google ScholarLocate open access versionFindings
  • [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 2, 4, 5
    Google ScholarLocate open access versionFindings
  • [35] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. PAMI, 2016. 5
    Google ScholarLocate open access versionFindings
  • [36] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv preprint, 2015. 1
    Google ScholarFindings
  • [37] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint, 2015. 5
    Google ScholarFindings
  • [38] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • [39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 4
    Google ScholarLocate open access versionFindings
  • [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 4
    Google ScholarFindings
  • [42] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollar. A multipath network for object detection. In ECCV, 2016. 3, 5, 7
    Google ScholarLocate open access versionFindings
  • [43] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015. 5
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments