AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We presented SegNet, a fully trainable deep architecture for joint feature learning and mapping an input image in a feed-forward manner to its pixel-wise semantic labels

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 12 (2017): 2481-2495

被引用7354|浏览696
EI WOS

摘要

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologi...更多

代码

数据

0
简介
  • Semantic segmentation is an important step towards understanding and inferring different objects and their arrangements observed in a scene.
  • An ad hoc technique is used to upsample the deepest layer feature map to match the input image dimensions by replicating features within a block i.e. all pixels within a block (8 × 8 in the example) have the same features
  • This often results in predictions that appear blocky2.
  • Ablation studies to understand the effects of features such as in [41] can be performed using the decoder stack
重点内容
  • Semantic segmentation is an important step towards understanding and inferring different objects and their arrangements observed in a scene
  • Deep learning has seen huge success lately in handwritten digit recognition, speech, categorising whole images and detecting objects in images [37, 34] seen growing interest in semantic pixelwise labelling problems [7, 14, 35]. These recent approaches have tried to directly adopt deep architectures designed for category prediction to pixel-wise labelling
  • The deepest layer representations/feature maps are of a small resolution as compared to input image dimensions due to several pooling layers e.g. if 2 × 2 non-overlapping max-pooling-subsampling layers are used three times, the resulting feature map is 1/8th of the input dimension
  • Deeper layers each with pooling-subsampling can be introduced which increases the spatial context for pixel labelling
  • We presented SegNet, a fully trainable deep architecture for joint feature learning and mapping an input image in a feed-forward manner to its pixel-wise semantic labels
方法
  • SfM+Appearance [2] Boosting [36].
  • Dense Depth Maps [43].
  • Structured Random Forests [18] not available.
  • Neural Decision Forests [3] not available.
  • Super Parsing [39] SegNet - 4 layer.
  • Boosting+Higher order [36].
  • Bed Objects Chair Furniture Ceiling Floor Decoration Sofa Table Wall Window Books TV Class avg.
  • Global avg.
  • SegNet - layer 4
结果
  • The authors' results show that SegNet achieves state-of-the-art performance even without use of additional cues such as depth, video frames or post-processing with CRF models.
结论
  • The authors presented SegNet, a fully trainable deep architecture for joint feature learning and mapping an input image in a feed-forward manner to its pixel-wise semantic labels.
  • A highlight of the proposed architecture is its ability to produce smooth segment labels when compared with local patch based classifiers
  • This is due to deep layers of feature encoding that employ a large spatial context for pixel-wise labelling.
  • To the best of the knowledge this is the first deep learning method to learn to map low resolution encoder feature maps to semantic labels
  • Both qualitative and numerical accuracy of the SegNet for outdoor and indoor scenes is very competitive, even without use of any CRF post-processing.
  • The encoder-decoder architecture of the SegNet can be trained unsupervised and to handle missing data in the input during test time
表格
  • Table1: Quantitative results on CamVid [<a class="ref-link" id="c1" href="#r1">1</a>]. We consider SegNet - 4 layer for comparisons with other methods. SegNet performs best on several challenging classes (cars, pedestrians, poles) while maintaining competitive accuracy on remaining classes. The class average and global average is the highest even when compared to methods using structure from motion [<a class="ref-link" id="c2" href="#r2">2</a>], CRF [<a class="ref-link" id="c36" href="#r36">36</a>, <a class="ref-link" id="c20" href="#r20">20</a>], dense depth maps [<a class="ref-link" id="c43" href="#r43">43</a>], temporal cues [<a class="ref-link" id="c39" href="#r39">39</a>]
  • Table2: Quantitative results on the NYU v2 [<a class="ref-link" id="c33" href="#r33">33</a>]. The SegNet performs better than the multi-scale convnet which uses the same inputs (and post-processing)
  • Table3: Quantitative results on the KITTI dataset [<a class="ref-link" id="c9" href="#r9">9</a>, <a class="ref-link" id="c29" href="#r29">29</a>]. The SegNet performance is better globally and comparable among classes. The fence class resembles buildings and needs other cues such as temporal information used in [<a class="ref-link" id="c29" href="#r29">29</a>] for better accuracy
Download tables as Excel
引用论文
  • G. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. PRL, 30(2):88–97, 2009. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • G. Brostow, J. Shotton, J., and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, Marseille, 2008. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • S. R. Bulo and P. Kontschieder. Neural decision forests for semantic image labelling. In CVPR, 2014. 6
    Google ScholarLocate open access versionFindings
  • C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, pages 184–199. Springer, 2013
    Google ScholarLocate open access versionFindings
  • D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283, 2014. 3
    Findings
  • C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimal covers. In ICML, 2012. 3, 6, 8
    Google ScholarLocate open access versionFindings
  • C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE PAMI, 35(8):1915–1929, 2013. 1, 3, 8
    Google ScholarLocate open access versionFindings
  • C. Gatta, A. Romero, and J. van de Weijer. Unrolling loopy top-down semantic feedback in convolutional deep networks. In CVPR Workshop on Deep Vision, 2014. 3
    Google ScholarLocate open access versionFindings
  • A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, pages 3354–3361, 2012. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV, pages 1–8. IEEE, 2009. 5
    Google ScholarLocate open access versionFindings
  • D. Grangier, L. Bottou, and R. Collobert. Deep convolutional networks for scene parsing. In ICML Workshop on Deep Learning, 2009. 3
    Google ScholarLocate open access versionFindings
  • S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In CVPR, pages 564–571. IEEE, 2013. 3, 8
    Google ScholarLocate open access versionFindings
  • A. Hermans, G. Floros, and B. Leibe. Dense 3D Semantic Mapping of Indoor Scenes from RGB-D Images. In ICRA, 2014. 3, 6, 8
    Google ScholarLocate open access versionFindings
  • N. Hft, H. Schulz, and S. Behnke. Fast semantic segmentation of rgb-d scenes with gpu-accelerated deep neural networks. In C. Lutz and M. Thielscher, editors, KI 2014: Advances in Artificial Intelligence, volume 8736 of Lecture Notes in Computer Science, pages 80–85. Springer International Publishing, 201
    Google ScholarLocate open access versionFindings
  • K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV, pages 2146–2153, 2009. 3
    Google ScholarLocate open access versionFindings
  • K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierarchies for visual recognition. In NIPS, pages 1090– 1098, 2010. 3, 4
    Google ScholarLocate open access versionFindings
  • V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In In: NIPS (2011, 2011. 6
    Google ScholarLocate open access versionFindings
  • P. Kontschieder, S. R. Bulo, H. Bischof, and M. Pelillo. Structured class-labels in random forests for semantic image labelling. In ICCV, pages 2190–2197. IEEE, 2011. 2, 6
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. 4
    Google ScholarLocate open access versionFindings
  • L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. H. S. Torr. What, where and how many? combining object detectors and crfs. In ECCV, pages 424–437, 2010. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep learning. In ICML, pages 265–272, 2011. 4
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 4
    Google ScholarLocate open access versionFindings
  • S. Lyu and E. P. Simoncelli. Nonlinear image representation using divisive normalization. In CVPR, 2008. 3
    Google ScholarLocate open access versionFindings
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011. 1
    Google ScholarLocate open access versionFindings
  • J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006. 4
    Google ScholarFindings
  • P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, pages 82–90, 2014. 3
    Google ScholarLocate open access versionFindings
  • M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. 1, 3, 4
    Google ScholarLocate open access versionFindings
  • X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In CVPR, pages 2759–2766. IEEE, 2012. 2, 8
    Google ScholarLocate open access versionFindings
  • G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and A. Lopez. Vision-based offline-online perception paradigm for autonomous driving. In WACV, 2015. 6, 8
    Google ScholarLocate open access versionFindings
  • B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool for image annotation. IJCV, 77(1-3):157–173, 2008. 5
    Google ScholarLocate open access versionFindings
  • M. Schmidt. minfunc: unconstrained differentiable multivariate optimization in matlab. h ttp://www. di. ens. fr/ mschmidt/Software/minFunc. html, 2012. 4
    Findings
  • J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In CVPR, 2008. 2, 4
    Google ScholarLocate open access versionFindings
  • N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760. Springer, 2012. 2, 3, 6, 8
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 1
    Findings
  • R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In ICML, pages 129–136, 2011. 1
    Google ScholarLocate open access versionFindings
  • P. Sturgess, K. Alahari, L. Ladicky, and P. H.S.Torr. Combining appearance and structure from motion features for road scene understanding. In BMVC, 2009. 1, 2, 6, 7
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 1
    Findings
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 4
    Findings
  • J. Tighe and S. Lazebnik. Superparsing. IJCV, 101(2):329– 349, 2013. 2, 6
    Google ScholarLocate open access versionFindings
  • Y. Yang, Z. Li, L. Zhang, C. Murphy, J. Ver Hoeve, and H. Jiang. Local label descriptor for example based semantic image labeling. In ECCV, pages 361–375. Springer, 2012. 6
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014. 1, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, pages 2528–2535. IEEE, 2010. 3, 4
    Google ScholarLocate open access versionFindings
  • C. Zhang, L. Wang, and R. Yang. Semantic segmentation of urban scenes using dense depth maps. In ECCV, pages 708–721. Springer, 2010. 2, 6
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科