AI helps you reading Science
AI Insight
AI extracts a summary of this paper
Weibo:
Bisenet: Bilateral Segmentation Network For Real-Time Semantic Segmentation
COMPUTER VISION - ECCV 2018, PT XIII, (2018): 334-349
EI
Abstract
Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a S...More
Code:
Data:
Introduction
- The research of semantic segmentation, which amounts to assign semantic labels to each pixel, is a fundamental task in computer vision.
- 2) Instead of resizing the input image, some works prune the channels of the network to boost the inference speed [1, 8, 25], especially in the early stages of the base model.
- It weakens the spatial capacity.
Highlights
- The research of semantic segmentation, which amounts to assign semantic labels to each pixel, is a fundamental task in computer vision
- Based on the above observation, we propose the Bilateral Segmentation Network (BiSeNet) with two parts: Spatial Path (SP) and Context Path (CP)
- – We propose a novel approach to decouple the function of spatial information preservation and receptive field offering into two paths
- We propose a Spatial Path to preserve the spatial size of the original input image and encode affluent spatial information
- With the Spatial Path and the Context Path, we propose BiSeNet for real-time semantic segmentation as illustrated in Figure 2(a)
- With the affluent spatial details and large receptive field, we achieve the result of 68.4% Mean IOU on Cityscapes [9] test dataset at 105 FPS
- The Spatial Path is designed to preserve the spatial information from original images
Methods
- Method BaseModel FLOPS Parameters Mean
IOU(%)
FCN-32s Xception39 185.5M FCN-32s Res18. - Baseline: The authors use the Xception39 network pretrained on ImageNet dataset [28] as the backbone of Context Path.
- The authors evaluate the performance of the base model as the baseline, as shown in Table 1.
- Where the authors use a lightweight model, Xception39, as the backbone of Context Path to down-sample quickly.
- The authors use the U-shape-8s structure, which improves the performance from 60.79% to 66.01%, as shown in Table 2.
- The authors don’t adopt the multi-scale testing
Results
- The authors adopt a modified Xception model [8], Xception39, into the real-time semantic segmentation task.
- The authors' implementation code will be made publicly available.
- The authors evaluate the proposed BiSeNet on Cityscapes [9], CamVid [2] and COCOStuff [3] benchmarks.
- The authors first introduce the datasets and the implementation protocol.
- The authors describe the speed strategy in comparison with other methods in detail.
- The authors evaluate all performance results on the Cityscapes validation set.
- The authors report the accuracy and speed results on Cityscapes, CamVid and
Conclusion
- Bilateral Segmentation Network (BiSeNet) is proposed in this paper to improve the speed and accuracy of real-time semantic segmentation simultaneously.
- The authors' proposed BiSeNet contains two paths: Spatial Path (SP) and Context Path (CP).
- The Context Path utilizes the lightweight model and global average pooling [6, 21, 40] to obtain sizeable receptive field rapidly.
- With the affluent spatial details and large receptive field, the authors achieve the result of 68.4% Mean IOU on Cityscapes [9] test dataset at 105 FPS
Tables
- Table1: Accuracy and parameter analysis of our baseline model: Xception39 and Res18 on Cityscapes validation dataset. Here we use FCN-32s as the base structure. FLOPS are estimated for input of 3 × 640 × 360
- Table2: Speed analysis of the U-shape-8s and the U-shape-4s on one NVIDIA Titan XP card. Image size is W×H
- Table3: Detailed performance comparison of each component in our proposed BiSeNet. CP: Context Path; SP: Spatial Path; GP: global average pooling; ARM: Attention Refinement Module; FFM: Feature Fusion Module
- Table4: Table 4
- Table5: Speed comparison of our method against other state-of-the-art methods. Image size is W×H. The Ours1 and Ours2 are the BiSeNet based on Xception39 and
- Table6: Accuracy and speed comparison of our method against other state-of-theart methods on Cityscapes test dataset. We train and evaluate on NVIDIA Titan XP with 2048×1024 resolution input. “-” indicates that the methods didn’t give the corresponding speed result of the accuracy
- Table7: Accuracy comparison of our method against other state-of-the-art methods on Cityscapes test dataset. “-” indicates that the methods didn’t give the corresponding result
- Table8: Accuracy result on CamVid test dataset. Ours1 and Ours2 indicate the model based on Xception39 and Res18 network
- Table9: Accuracy result on COCO-Stuff validation dataset
Related work
- Recently, lots of approaches based on FCN [22] have achieved the state-of-the-art performance on different benchmarks of the semantic segmentation task. Most of these methods are designed to encode more spatial information or enlarge the receptive field.
Spatial information: The convolutional neural network (CNN) [16] encodes high-level semantic information with consecutive down-sampling operations. However, in the semantic segmentation task, the spatial information of the image is crucial to predicting the detailed output. Modern existing approaches devote to encode affluent spatial information. DUC [32], PSPNet [40], DeepLab v2 [5], and Deeplab v3 [6] use the dilated convolution to preserve the spatial size of the feature map. Global Convolution Network [26] utilizes the “large kernel” to enlarge the receptive field.
Funding
- This work was supported by the Project of the National Natural Science Foundation of China No.61433007 and No.61401170
Reference
- Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12), 2481–2495 (2017) 2, 4, 5, 6, 9, 11, 12
- Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: European Conference on Computer Vision. pp. 44–57 (2008) 3, 7, 8, 12
- Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: IEEE Conference on Computer Vision and Pattern Recognition (2018) 3, 7, 8, 12
- Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. ICLR (2015) 13
- Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv (2016) 3, 4, 5, 6, 8, 13
- Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv (2017) 3, 4, 5, 6, 8, 14
- Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scaleaware semantic image segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 4
- Chollet, F.: Xception: Deep learning with depthwise separable convolutions. IEEE Conference on Computer Vision and Pattern Recognition (2017) 2, 3, 6, 7
- Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 3, 7, 8, 12, 14
- Ghiasi, G., Fowlkes, C.C.: Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European Conference on Computer Vision (2016) 4, 13
- Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics. pp. 315–323 (2011) 5, 9, 10
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 12
- Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv (2017) 4, 7 14. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1mb model size. arXiv abs/1602.07360 (2016) 12 15.
- Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015) 5, 7, 9, 10 16.
- Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012) 3, 8 17.
- Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. IEEE Conference on Computer Vision and Pattern Recognition (2017) 2, 4, 12
- 18. Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition (2017) 4, 13
- 19. Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 13
- 20. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision. Springer (2014) 8
- 21. Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. ICLR (2016) 6, 8, 11, 14
- 22. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2015) 3, 4, 9, 13
- 23. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Neural Information Processing Systems (2014) 4
- 24. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: IEEE International Conference on Computer Vision (2015) 4
- 25. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network architecture for real-time semantic segmentation. arXiv (2016) 2, 4, 5, 6, 9, 11, 12
- 26. Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve semantic segmentation by global convolutional network. IEEE Conference on Computer Vision and Pattern Recognition (2017) 3, 4, 5, 6
- 27. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2015) 4
- 28. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y 9
- 29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2015) 11, 13
- 30. Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R., Friedmann, F., Schuberth, P., Mayr, A., Heusel, M., Hofmarcher, M., Widrich, M., et al.: Speeding up semantic segmentation for autonomous driving. In: Neural Information Processing Systems Workshop (2016) 12
- 31. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. IEEE Conference on Computer Vision and Pattern Recognition (2017) 4
- 32. Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition (2017) 3, 4, 5, 13
- 33. Wu, Z., Shen, C., Hengel, A.v.d.: High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339 (2016) 12
- 34. Wu, Z., Shen, C., Hengel, A.v.d.: Real-time semantic image segmentation via spatial sparsity. arXiv (2017) 2, 4, 12
- 35. Xie, S., Tu, Z.: Holistically-nested edge detection. In: IEEE International Conference on Computer Vision (2015) 2, 6, 7, 9
- 36. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2018) 4, 11
- 37. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ICLR (2016) 4, 13
- 38. Zhang, R., Tang, S., Zhang, Y., Li, J., Yan, S.: Scale-adaptive convolutions for scene parsing. In: IEEE International Conference on Computer Vision. pp. 2031– 2039 (2017) 4
- 39. Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. arXiv (2017) 2, 4, 5, 12
- 40. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. IEEE Conference on Computer Vision and Pattern Recognition (2017) 3, 4, 5, 6, 12, 13, 14
Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn