We presented SegNet, a fully trainable deep architecture for joint feature learning and mapping an input image in a feed-forward manner to its pixel-wise semantic labels
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 12 (2017): 2481-2495
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologi...更多
下载 PDF 全文
- Semantic segmentation is an important step towards understanding and inferring different objects and their arrangements observed in a scene.
- An ad hoc technique is used to upsample the deepest layer feature map to match the input image dimensions by replicating features within a block i.e. all pixels within a block (8 × 8 in the example) have the same features
- This often results in predictions that appear blocky2.
- Ablation studies to understand the effects of features such as in  can be performed using the decoder stack
- Semantic segmentation is an important step towards understanding and inferring different objects and their arrangements observed in a scene
- Deep learning has seen huge success lately in handwritten digit recognition, speech, categorising whole images and detecting objects in images [37, 34] seen growing interest in semantic pixelwise labelling problems [7, 14, 35]. These recent approaches have tried to directly adopt deep architectures designed for category prediction to pixel-wise labelling
- The deepest layer representations/feature maps are of a small resolution as compared to input image dimensions due to several pooling layers e.g. if 2 × 2 non-overlapping max-pooling-subsampling layers are used three times, the resulting feature map is 1/8th of the input dimension
- Deeper layers each with pooling-subsampling can be introduced which increases the spatial context for pixel labelling
- We presented SegNet, a fully trainable deep architecture for joint feature learning and mapping an input image in a feed-forward manner to its pixel-wise semantic labels
- SfM+Appearance  Boosting .
- Dense Depth Maps .
- Structured Random Forests  not available.
- Neural Decision Forests  not available.
- Super Parsing  SegNet - 4 layer.
- Boosting+Higher order .
- Bed Objects Chair Furniture Ceiling Floor Decoration Sofa Table Wall Window Books TV Class avg.
- Global avg.
- SegNet - layer 4
- The authors' results show that SegNet achieves state-of-the-art performance even without use of additional cues such as depth, video frames or post-processing with CRF models.
- The authors presented SegNet, a fully trainable deep architecture for joint feature learning and mapping an input image in a feed-forward manner to its pixel-wise semantic labels.
- A highlight of the proposed architecture is its ability to produce smooth segment labels when compared with local patch based classifiers
- This is due to deep layers of feature encoding that employ a large spatial context for pixel-wise labelling.
- To the best of the knowledge this is the first deep learning method to learn to map low resolution encoder feature maps to semantic labels
- Both qualitative and numerical accuracy of the SegNet for outdoor and indoor scenes is very competitive, even without use of any CRF post-processing.
- The encoder-decoder architecture of the SegNet can be trained unsupervised and to handle missing data in the input during test time
- Table1: Quantitative results on CamVid [<a class="ref-link" id="c1" href="#r1">1</a>]. We consider SegNet - 4 layer for comparisons with other methods. SegNet performs best on several challenging classes (cars, pedestrians, poles) while maintaining competitive accuracy on remaining classes. The class average and global average is the highest even when compared to methods using structure from motion [<a class="ref-link" id="c2" href="#r2">2</a>], CRF [<a class="ref-link" id="c36" href="#r36">36</a>, <a class="ref-link" id="c20" href="#r20">20</a>], dense depth maps [<a class="ref-link" id="c43" href="#r43">43</a>], temporal cues [<a class="ref-link" id="c39" href="#r39">39</a>]
- Table2: Quantitative results on the NYU v2 [<a class="ref-link" id="c33" href="#r33">33</a>]. The SegNet performs better than the multi-scale convnet which uses the same inputs (and post-processing)
- Table3: Quantitative results on the KITTI dataset [<a class="ref-link" id="c9" href="#r9">9</a>, <a class="ref-link" id="c29" href="#r29">29</a>]. The SegNet performance is better globally and comparable among classes. The fence class resembles buildings and needs other cues such as temporal information used in [<a class="ref-link" id="c29" href="#r29">29</a>] for better accuracy
- G. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. PRL, 30(2):88–97, 2009. 2, 5, 6
- G. Brostow, J. Shotton, J., and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, Marseille, 2008. 1, 2, 6
- S. R. Bulo and P. Kontschieder. Neural decision forests for semantic image labelling. In CVPR, 2014. 6
- C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, pages 184–199. Springer, 2013
- D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283, 2014. 3
- C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimal covers. In ICML, 2012. 3, 6, 8
- C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE PAMI, 35(8):1915–1929, 2013. 1, 3, 8
- C. Gatta, A. Romero, and J. van de Weijer. Unrolling loopy top-down semantic feedback in convolutional deep networks. In CVPR Workshop on Deep Vision, 2014. 3
- A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, pages 3354–3361, 2012. 2, 5, 6
- S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV, pages 1–8. IEEE, 2009. 5
- D. Grangier, L. Bottou, and R. Collobert. Deep convolutional networks for scene parsing. In ICML Workshop on Deep Learning, 2009. 3
- S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In CVPR, pages 564–571. IEEE, 2013. 3, 8
- A. Hermans, G. Floros, and B. Leibe. Dense 3D Semantic Mapping of Indoor Scenes from RGB-D Images. In ICRA, 2014. 3, 6, 8
- N. Hft, H. Schulz, and S. Behnke. Fast semantic segmentation of rgb-d scenes with gpu-accelerated deep neural networks. In C. Lutz and M. Thielscher, editors, KI 2014: Advances in Artificial Intelligence, volume 8736 of Lecture Notes in Computer Science, pages 80–85. Springer International Publishing, 201
- K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV, pages 2146–2153, 2009. 3
- K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierarchies for visual recognition. In NIPS, pages 1090– 1098, 2010. 3, 4
- V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In In: NIPS (2011, 2011. 6
- P. Kontschieder, S. R. Bulo, H. Bischof, and M. Pelillo. Structured class-labels in random forests for semantic image labelling. In ICCV, pages 2190–2197. IEEE, 2011. 2, 6
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. 4
- L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. H. S. Torr. What, where and how many? combining object detectors and crfs. In ECCV, pages 424–437, 2010. 2, 6, 7
- Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep learning. In ICML, pages 265–272, 2011. 4
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 4
- S. Lyu and E. P. Simoncelli. Nonlinear image representation using divisive normalization. In CVPR, 2008. 3
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011. 1
- J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006. 4
- P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, pages 82–90, 2014. 3
- M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. 1, 3, 4
- X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In CVPR, pages 2759–2766. IEEE, 2012. 2, 8
- G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and A. Lopez. Vision-based offline-online perception paradigm for autonomous driving. In WACV, 2015. 6, 8
- B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool for image annotation. IJCV, 77(1-3):157–173, 2008. 5
- M. Schmidt. minfunc: unconstrained differentiable multivariate optimization in matlab. h ttp://www. di. ens. fr/ mschmidt/Software/minFunc. html, 2012. 4
- J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In CVPR, 2008. 2, 4
- N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760. Springer, 2012. 2, 3, 6, 8
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 1
- R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In ICML, pages 129–136, 2011. 1
- P. Sturgess, K. Alahari, L. Ladicky, and P. H.S.Torr. Combining appearance and structure from motion features for road scene understanding. In BMVC, 2009. 1, 2, 6, 7
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 1
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 4
- J. Tighe and S. Lazebnik. Superparsing. IJCV, 101(2):329– 349, 2013. 2, 6
- Y. Yang, Z. Li, L. Zhang, C. Murphy, J. Ver Hoeve, and H. Jiang. Local label descriptor for example based semantic image labeling. In ECCV, pages 361–375. Springer, 2012. 6
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014. 1, 3, 4, 5
- M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, pages 2528–2535. IEEE, 2010. 3, 4
- C. Zhang, L. Wang, and R. Yang. Semantic segmentation of urban scenes using dense depth maps. In ECCV, pages 708–721. Springer, 2010. 2, 6