AI helps you reading Science
AI Insight
AI extracts a summary of this paper
Weibo:
The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, no. 1 (2017): 1175-1183
EI
Keywords
Abstract
State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of t...More
Code:
Data:
Introduction
- Convolutional Neural Networks (CNNs) are driving major advances in many computer vision tasks, such as image classification [29], object detection [25, 24] and semantic image segmentation [20].
- Convolutional Networks (FCNs) [20, 27] were introduced in the literature as a natural extension of CNNs to tackle per pixel prediction problems such as semantic image segmentation.
- FCNs add upsampling layers to standard CNNs to recover the spatial resolution of the input at the output layer.
- In order to compensate for the resolution loss induced by pooling layers, FCNs introduce skip connections between their downsampling and upsampling paths.
- Skip connections help the upsampling path recover fine-grained information from the downsampling layers
Highlights
- Convolutional Neural Networks (CNNs) are driving major advances in many computer vision tasks, such as image classification [29], object detection [25, 24] and semantic image segmentation [20]
- Convolutional Networks (FCNs) [20, 27] were introduced in the literature as a natural extension of Convolutional Neural Networks to tackle per pixel prediction problems such as semantic image segmentation
- Our fully convolutional DenseNet implicitly inherits the advantages of DenseNets, namely: (1) parameter efficiency, as our network has substantially less parameters than other segmentation architectures published for the Camvid dataset; (2) implicit deep supervision, we tried including additional levels of supervision to different layers of our network without noticeable change in performance; and (3) feature reuse, as all layers can access their preceding layers due to the iterative concatenation of feature maps in a dense block and thanks to skip connections that enforce connectivity between downsampling and upsampling path
- We have extended DenseNets and made them fully convolutional to tackle the problem semantic image segmentation
- The main idea behind DenseNets is captured in dense blocks that perform iterative concatenation of feature maps
- We designed an upsampling path mitigating the linear growth of feature maps that would appear in a naive extension of DenseNets
Methods
- The authors evaluate the method on two urban scene understanding datasets: CamVid [2], and Gatech [22].
- The authors trained the models from scratch without using any extra-data nor postprocessing module.
- The authors report the results using the Intersection over Union (IoU) metric and the global accuracy.
- For a given class c, predictions and targets, the IoU is defined by IoU (c) = i , i (4).
- The authors compute IoU by summing over all the pixels i of the dataset.
Results
- The authors achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining.
- The authors show that such a network can outperform current state-of-the-art results on standard benchmarks for urban scene understanding without neither using pretrained parameters nor any further post-processing.
- The authors' model is able to outperform such state-of-the-art model, without requiring any temporal smoothing
Conclusion
- The authors' fully convolutional DenseNet implicitly inherits the advantages of DenseNets, namely: (1) parameter efficiency, as the network has substantially less parameters than other segmentation architectures published for the Camvid dataset; (2) implicit deep supervision, the authors tried including additional levels of supervision to different layers of the network without noticeable change in performance; and (3) feature reuse, as all layers can access their preceding layers due to the iterative concatenation of feature maps in a dense block and thanks to skip connections that enforce connectivity between downsampling and upsampling path.
Recent evidence suggest that ResNets behave like ensemble of relatively shallow networks [35]: ”Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks”. - Thanks to the smart connectivity patterns, FC-DenseNets might represent an ensemble of variable depth networks
- This particular ensemble behavior would be very interesting for semantic segmentation models, where the ensemble of different paths throughout the model would capture the multi-scale appearance of objects in urban scene.In this paper, the authors have extended DenseNets and made them fully convolutional to tackle the problem semantic image segmentation.
Tables
- Table1: Building blocks of fully convolutional DenseNets. From left to right: layer used in the model, Transition Down (TD) and Transition Up (TU). See text for details
- Table2: Architecture details of FC-DenseNet103 model used in our experiments. This model is built from 103 convolutional layers. In the Table we use following notations: DB stands for Dense Block, TD stands for Transition Down, TU stands for Transition Up, BN stands for Batch Normalization and m corresponds to the total number of feature maps at the end of a block. c stands for the number of classes
- Table3: Results on CamVid dataset. Note that we trained our own pretrained FCN8 model
- Table4: Results on Gatech dataset in network parallelization, and Tristan Sylvain. We acknowledge the support of the following agencies for research funding and computing support: Imagia Inc., Spanish projects TRA2014-57088-C2-1-R & 2014-SGR-1506, TECNIOspring-FP7-ACCI grant
Related work
- Recent advances in semantic segmentation have been devoted to improve architectural designs by (1) improving the upsampling path and increasing the connectivity within FCNs [27, 1, 21, 8]; (2) introducing modules to account for broader context understanding [36, 5, 37]; and/or (3) endowing FCN architectures with the ability to provide structured outputs [16, 5, 38].
First, different alternatives have been proposed in the literature to address the resolution recovery in FCN’s upsampling path; from simple bilinear interpolation [10, 20, 1] to more sophisticated operators such as unpooling [1, 21] or transposed convolutions [20]. Skip connections from the downsampling to the upsampling path have also been adopted to allow for a finer information recovery [27]. More recently, [8] presented a thorough analysis on the combination of identity mapping [11] and long skip connections [27] for semantic segmentation.
Second, approaches that introduce larger context to semantic segmentation networks include [10, 36, 5, 37]. In [10], an unsupervised global image descriptor is computed added to the feature maps for each pixel. In [36], Recurrent Neural Networks (RNNs) are used to retrieve contextual information by sweeping the image horizontally and vertically in both directions. In [5], dilated convolutions are introduced as an alternative to late CNN pooling layers to capture larger context without reducing the image resolution. Following the same spirit, [37] propose to provide FCNs with a context module built as a stack of dilated convolutional layers to enlarge the field of view of the network.
Funding
- We acknowledge the support of the following agencies for research funding and computing support: Imagia Inc., Spanish projects TRA2014-57088-C2-1-R & 2014-SGR-1506, TECNIOspring-FP7-ACCI grant
Reference
- V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015.
- G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision (ECCV), 2008.
- L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. CoRR, abs/1607.07295, 2016.
- L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference of Learning Representations (ICLR), 2015.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- S. Dieleman, J. Schlter, C. Raffel, E. Olson, and et al. Lasagne: First release., Aug. 2015.
- M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal. The importance of skip connections in biomedical image segmentation. CoRR, abs/1608.04117, 2016.
- A. R. F. Visin. Dataset loaders: a python library to load and preprocess datasets. https://github.com/fvisin/dataset_loaders, 2017.
- C. Gatta, A. Romero, and J. van de Weijer. Unrolling loopy top-down semantic feedback in convolutional deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.
- G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
- A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. CoRR, abs/1511.02680, 2015.
- P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems (NIPS). 2011.
- A. Kundu, V. Vineet, and V. Koltun. Feature space optimization for semantic video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- C. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. arXiv preprint arXiv:1505.04366, 2015.
- S. H. Raza, M. Grundmann, and I. Essa. Geometric context from video. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
- S. H. Raza, M. Grundmann, and I. Essa. Geometric context from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
- J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
- S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
- S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision (ECCV), 2016.
- O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICAI), 2015.
- G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016.
- T. Tieleman and G. Hinton. rmsprop adaptive learning. In COURSERA: Neural Networks for Machine Learning, 2012.
- D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Deep end2end voxel2voxel prediction. CoRR, abs/1511.06681, 2015.
- A. Veit, M. J. Wilber, and S. J. Belongie. Residual networks are exponential ensembles of relatively shallow networks. CoRR, abs/1605.06431, 2016.
- F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci, and A. Courville. Reseg: A recurrent neural network-based model for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2016.
- F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference of Learning Representations (ICLR), 2016.
- S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015.
Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn