Fully Convolutional Instance-aware Semantic Segmentation
CVPR, 2017.
EI
Weibo:
Abstract:
We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all the merits of FCNs for semantic segmentation [29] and instance mask proposal [5]. It performs instance mask prediction and classification jointly. The underlying convolutional representation is fully shared between t...More
Code:
Data:
Introduction
- Convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation.
- An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories, as illustrated in Figure 1(a).
- Instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions.
- This behavior cannot be modeled by a single FCN on the whole image.
Highlights
- Convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation
- Conventional Fully convolutional networks (FCNs) do not work for the instance-aware semantic segmentation task, which requires the detection and segmentation of individual object instances
- Instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions
- This behavior cannot be modeled by a single FCN on the whole image
- Following the protocol in [15, 7, 16, 8], model training is performed on the VOC 2012 train set, and evaluation is performed on the VOC 2012 validation set, with the additional instance mask annotations from [14]
- We present the first fully convolutional method for instance-aware semantic segmentation
Methods
- Ablation experiments are performed to study the proposed FCIS method on the PASCAL VOC dataset [11].
- Comparison with MNC The authors compare the proposed FCIS method with MNC [8], the 1st place entry in COCO segmentation challenge 2015.
- Both methods perform mask prediction and classification in ROIs, and share similar training/inference procedures.
Results
- The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step.
- Similar to [42], the FCIS method is applied on the original and the flipped images, and the results in the corresponding ROIs are averaged.
- This helps increase the accuracy by 0.7%.
Conclusion
- The authors present the first fully convolutional method for instance-aware semantic segmentation.
- It extends the existing FCN-based approaches and significantly pushes forward the state-of-the-art in both accuracy and efficiency for the task.
- The high performance benefits from the highly integrated and efficient network architecture, especially a novel joint formulation
Summary
Introduction:
Convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation.- An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories, as illustrated in Figure 1(a).
- Instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions.
- This behavior cannot be modeled by a single FCN on the whole image.
Methods:
Ablation experiments are performed to study the proposed FCIS method on the PASCAL VOC dataset [11].- Comparison with MNC The authors compare the proposed FCIS method with MNC [8], the 1st place entry in COCO segmentation challenge 2015.
- Both methods perform mask prediction and classification in ROIs, and share similar training/inference procedures.
Results:
The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step.- Similar to [42], the FCIS method is applied on the original and the flipped images, and the results in the corresponding ROIs are averaged.
- This helps increase the accuracy by 0.7%.
Conclusion:
The authors present the first fully convolutional method for instance-aware semantic segmentation.- It extends the existing FCN-based approaches and significantly pushes forward the state-of-the-art in both accuracy and efficiency for the task.
- The high performance benefits from the highly integrated and efficient network architecture, especially a novel joint formulation
Tables
- Table1: Ablation study of (almost) fully convolutional methods on PASCAL VOC 2012 validation set
- Table2: Comparison with MNC [<a class="ref-link" id="c8" href="#r8">8</a>] on COCO test-dev set, using ResNet-101 model. Timing is evaluated on a Nvidia K40 GPU
- Table3: Results of using networks of different depths in FCIS
- Table4: Instance-aware semantic segmentation results of different entries for the COCO segmentation challenge (2015 and 2016) on COCO test-dev set
Related work
- Semantic Image Segmentation The task is to assign every pixel in the image a semantic category label. It does not distinguish object instances. Recently, this field has been dominated by a prevalent family of approaches based on FCNs [29]. The FCNs are extended with global context [28], multi-scale feature fusion [4], and deconvolution [31]. Recent works in [3, 43, 37, 24] integrated FCNs with conditional random fields (CRFs). The expensive CRFs are replaced by more efficient domain transform in [2]. As the per-pixel category labeling is expensive, the supervision signals in FCNs have been relaxed to boxes [6], scribbles [23], or weakly supervised image class labels [19, 20].
Funding
- The approach takes 1.4 seconds per image, where more than 80% of the time is spent on the last per-ROI step
- The mAPr scores of the naıve MNC baseline are 59.1% and 36.0% at IoU thresholds of 0.5 and 0.7 respectively. They are 5.5% and 12.9% lower than those of the original MNC [8], which keeps 10 layers in ResNet-101 in the per-ROI sub-networks
- Multiscale testing improves the accuracy by 2.8%
- Similar to [42], the FCIS method is applied on the original and the flipped images, and the results in the corresponding ROIs are averaged. This helps increase the accuracy by 0.7%
- By taking the enclosing boxes of the instance masks as detected bounding boxes, it achieves an object detection accuracy of 39.7% on COCO test-dev set, measured by the standard mAPb@[0.5:0.95] score
Study subjects and analysis
cases: 3
Our joint formulation fuses the two answers into two scores: inside and outside. There are three cases: 1) high inside score and low outside score: detection+, segmentation+; 2) low inside score and high outside score: detection+, segmentation-; 3) both scores are low: detection-, segmentation-. The two scores answer the two questions jointly via softmax and max operations
Reference
- P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014. 5
- [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 4, 5
- [4] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016. 5
- [5] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016. 1, 2, 3, 5, 6
- [6] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015
- [7] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015. 1, 3, 5, 6
- [8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016. 1, 2, 3, 4, 5, 6, 7
- [9] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 5, 6
- [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2006
- [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 205, 6
- [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 4, 5
- [14] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011. 6
- [15] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. 3, 5, 6
- [16] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 1, 3, 5, 6
- [17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 7
- [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 201, 2, 4, 6, 7
- [19] S. Hong, H. Noh, and B. Han. Decoupled deep neural network for semi-supervised semantic segmentation. In NIPS, 2015. 5
- [20] S. Hong, J. Oh, B. Han, and H. Lee. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In CVPR, 2016. 5
- [21] K. Li, B. Hariharan, and J. Malik. Iterative instance segmentation. In CVPR, 2016. 5
- [22] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint, 2015. 5
- [23] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016. 5
- [24] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, 2016. 5
- [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014. 1, 2, 5, 6
- [26] S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In CVPR, 2016. 5
- [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, 2016. 7
- [28] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. In ICLR workshop, 2016. 5
- [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 2, 4, 5
- [31] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015. 5
- [32] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 5
- [33] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learning to refine object segments. In ECCV, 2016. 5
- [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 2, 4, 5
- [35] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. PAMI, 2016. 5
- [36] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv preprint, 2015. 1
- [37] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint, 2015. 5
- [38] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016. 5
- [39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 4
- [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 4
- [42] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollar. A multipath network for object detection. In ECCV, 2016. 3, 5, 7
- [43] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015. 5
Full Text
Tags
Comments