The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes

Laura Sellart
Laura Sellart
Joanna Materzynska
Joanna Materzynska

CVPR, pp. 3234-3243, 2016.

Cited by: 744|Bibtex|Views60
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We presented SYNTHetic collection of Imagery and Annotations, a new dataset for semantic segmentation of driving scenes with more than 213,400 synthetic images including both, random snapshots and video sequences in a virtual city

Abstract:

Vision-based semantic segmentation in urban scenarios is a key functionality for autonomous driving. Recent revolutionary results of deep convolutional neural networks (DCNNs) foreshadow the advent of reliable classifiers to perform such visual tasks. However, DCNNs require learning of many parameters from raw images, thus, having a suffi...More

Code:

Data:

0
Introduction
  • Autonomous driving (AD) will be one of the most revolutionary technologies in the near future in terms of the impact on the lives of citizens of the industrialized countries [43].
  • The computer vision community, among others, is contributing to the development of ADAS and AD due to the rapidly increasing performance of vision-based tools such as object detection, recognition of traffic signs, road segmentation, etc.
  • Until the end of the first decade of this century, the design of classifiers for recognizing visual phenomena was viewed as a two-fold problem.
  • Enormous effort was invested in research of discriminative visual descriptors to be fed as features to classifiers; as a result, descriptors such as Haar wavelets, SIFT, LBP, or HOG, were born and there use became widespread.
  • Many different machine learning methods were developed, with dis-
Highlights
  • Autonomous driving (AD) will be one of the most revolutionary technologies in the near future in terms of the impact on the lives of citizens of the industrialized countries [43]
  • In this paper we address the question of how useful can the use of realistic synthetic images of virtual-world urban scenarios be for the task of semantic segmentation – in particular, when using a deep convolutional neural networks (DCNNs) paradigm
  • We explore the benefits of using SYNTHetic collection of Imagery and Annotations (SYNTHIA) in the context of semantic segmentation of urban environments with DCNNs
  • We present the evaluation of the DCNNs for semantic segmentation described in section 4, training and evaluating on several state-of-the-art datasets of driving scenes
  • We presented SYNTHIA, a new dataset for semantic segmentation of driving scenes with more than 213,400 synthetic images including both, random snapshots and video sequences in a virtual city
  • Our experiments showed that SYNTHIA is good enough to produce good segmentations by itself on real datasets, dramatically boosting accuracy in combination with real data
Methods
  • Validation sky building road sidewalk fence vegetat.
  • SYNTHIA-Rand (A) KITTI (V) 73 78 92 27 0 10 0 64 0 72 14 39.0 61.9 T-Net [30].
  • SYNTHIA-Rand (A) KITTI (V) 56 65 59 26 17 65 32 52 42 73 40 47.1 62.7 FCN [20].
  • Validation sky building road sidewalk fence vegetation pole car sign pedestrian cyclist per-class global Camvid (T).
  • KITTI (T) + SYNTHIA-Rand (A) T-Net [30] U-LabelMe (T)
Conclusion
  • The authors presented SYNTHIA, a new dataset for semantic segmentation of driving scenes with more than 213,400 synthetic images including both, random snapshots and video sequences in a virtual city.
  • SYNTHIA was used to train DCNNs for the semantic segmentation of 11 common classes in driving scenes.
  • The authors' experiments showed that SYNTHIA is good enough to produce good segmentations by itself on real datasets, dramatically boosting accuracy in combination with real data.
  • The authors believe that SYNTHIA will help to boost semantic segmentation research
Summary
  • Introduction:

    Autonomous driving (AD) will be one of the most revolutionary technologies in the near future in terms of the impact on the lives of citizens of the industrialized countries [43].
  • The computer vision community, among others, is contributing to the development of ADAS and AD due to the rapidly increasing performance of vision-based tools such as object detection, recognition of traffic signs, road segmentation, etc.
  • Until the end of the first decade of this century, the design of classifiers for recognizing visual phenomena was viewed as a two-fold problem.
  • Enormous effort was invested in research of discriminative visual descriptors to be fed as features to classifiers; as a result, descriptors such as Haar wavelets, SIFT, LBP, or HOG, were born and there use became widespread.
  • Many different machine learning methods were developed, with dis-
  • Objectives:

    The aim of this work is to show that the use of synthetic data helps to improve semantic segmentation results on real imagery.
  • Methods:

    Validation sky building road sidewalk fence vegetat.
  • SYNTHIA-Rand (A) KITTI (V) 73 78 92 27 0 10 0 64 0 72 14 39.0 61.9 T-Net [30].
  • SYNTHIA-Rand (A) KITTI (V) 56 65 59 26 17 65 32 52 42 73 40 47.1 62.7 FCN [20].
  • Validation sky building road sidewalk fence vegetation pole car sign pedestrian cyclist per-class global Camvid (T).
  • KITTI (T) + SYNTHIA-Rand (A) T-Net [30] U-LabelMe (T)
  • Conclusion:

    The authors presented SYNTHIA, a new dataset for semantic segmentation of driving scenes with more than 213,400 synthetic images including both, random snapshots and video sequences in a virtual city.
  • SYNTHIA was used to train DCNNs for the semantic segmentation of 11 common classes in driving scenes.
  • The authors' experiments showed that SYNTHIA is good enough to produce good segmentations by itself on real datasets, dramatically boosting accuracy in combination with real data.
  • The authors believe that SYNTHIA will help to boost semantic segmentation research
Tables
  • Table1: Driving scenes sets for semantic segmentation. We define the number of training images (T), validation (V) and in total (A)
  • Table2: Results of training a T-Net and a FCN on SYNTHIA-Rand and evaluating it on state-of-the-art datasets of driving scenes
  • Table3: Comparison of training a T-Net and FCN on real images only and the effect of extending training sets with SYNTHIA-Rand
Download tables as Excel
Related work
  • The generation of semantic segmentation datasets with pixel-level annotations is costly in terms of effort and money, factors that are currently slowing down the development of new large-scale collections like ImageNet [15]. Despite these factors, the community has invested great effort to create datasets such as the NYU-Depth V2 [23] (more than 1,449 images densely labelled), the PASCAL-Context Dataset [22] (10,103 images densely labelled over 540 categories), and MS COCO [19] (more than 300,000 images with annotations for 80 object categories). These datasets have definitely contributed to boost research on semantic segmentation of indoor scenes and also on common objects, but they are not suitable for more specific tasks such as those involved in autonomous navigation scenarios.

    When semantic segmentation is seen in the context of autonomous vehicles, we find that the amount and variety of annotated images of urban scenarios is much lower in terms of total number of labeled pixels, number of classes and instances. A good example is the CamVid [4] dataset, which consists of a set of monocular images taken in Cambridge, UK. However, only 701 images contain pixel-level annotations over a total of 32 categories (combining objects and architectural scenes), although usually only the 11 largest categories are used. Similarly, Daimler Urban Segmentation dataset [33] contains 500 fully labelled monochrome frames for 5 categories. The more recent KITTI benchmark suite [9] has provided a large amount of images of urban scenes from Karlsruhe, Germany, with ground truth data for several tasks. However, it only contains a total of 430 labelled images for semantic segmentation.
Funding
  • Authors want to thank Andrew Bagdanov for his help and proofreading and the next funding bodies: the Spanish MEC Project TRA2014-57088-C2-1-R, the Spanish DGT Project SPIP201401352, the People Programme (Marie Curie Actions) FP7/2007-2013 REA grant agreement no. 600388, and by the Agency of Competitiveness for Companies of the Government of Catalonia, ACCIO, the Generalitat de Catalunya Project 2014-SGR-1506 and the NVIDIA Corporation for the generous support in the form of different GPU hardware units. criminative algorithms such as SVM, AdaBoost, or Random Forests usually reporting the best classification accuracy due to their inherent focus on searching for reliable class boundaries in feature space
Reference
  • M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarLocate open access versionFindings
  • V. Badrinarayanan, A. Handa, and R. Cipolla. SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint abs/1505.07293, 2015.
    Findings
  • S. Bileschi. CBCL StreetScenes challenge framework, 2007.
    Google ScholarFindings
  • G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2009.
    Google ScholarLocate open access versionFindings
  • G. J. Brostow, J. Shotton, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In Eur. Conf. on Computer Vision (ECCV), 2008.
    Google ScholarLocate open access versionFindings
  • P. P. Busto, J. Liebelt, and J. Gall. Adaptation of synthetic data for coarse-to-fine viewpoint refinement. In British Machine Vision Conf. (BMVC), 2015.
    Google ScholarLocate open access versionFindings
  • K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional networks. In British Machine Vision Conf. (BMVC), 2014.
    Google ScholarLocate open access versionFindings
  • M. Cordts, M. Omran, S. Ramos, T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset. In CVPR, Workshop, 2015.
    Google ScholarLocate open access versionFindings
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets Robotics: The KITTI Dataset. Intl. J. of Robotics Research, 2013.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarLocate open access versionFindings
  • A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla. Synthcam3d: Semantic understanding with synthetic indoor scenes. arXiv preprint abs/1505.00171, 2015.
    Findings
  • H. Hattori, V. Naresh Boddeti, K. M. Kitani, and T. Kanade. Learning scene-specific pedestrian detectors without real data. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015.
    Google ScholarFindings
  • B. Kaneva, A. Torralba, and W. T. Freeman. Evaluating image feaures using a photorealistic virtual world. In Intl. Conf. on Computer Vision (ICCV), 2011.
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sustkever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg. Joint semantic segmentation and 3D reconstruction from monocular video. In Eur. Conf. on Computer Vision (ECCV), 2014.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In Eur. Conf. on Computer Vision (ECCV), 2014.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
    Google ScholarLocate open access versionFindings
  • J. Marin, D. Vazquez, D. Geronimo, and A. Lopez. Learning appearance in virtual scenarios for pedestrian detection. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2010.
    Google ScholarLocate open access versionFindings
  • R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarLocate open access versionFindings
  • P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In Eur. Conf. on Computer Vision (ECCV), 2012.
    Google ScholarLocate open access versionFindings
  • H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. Intl. Conf. on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • P. Panareda, J. Liebelt, and J. Gall. Adaptation of synthetic data for coarse-to-fine viewpoint refinement. In British Machine Vision Conf. (BMVC), 2015.
    Google ScholarLocate open access versionFindings
  • J. Papon and M. Schoeler. Semantic pose using deep networks trained on synthetic RGB-D. In Intl. Conf. on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep object detectors from 3D models. In Intl. Conf. on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • L. Pishchulin, A. Jain, M. Andriluka, T. Thormahlen, and B. Schiele. Articulated people detection and pose estimation: reshaping the future. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.
    Google ScholarLocate open access versionFindings
  • G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and A. M. Lopez. Vision-based offline-online perception paradigm for autonomous driving. In Winter Conference on Applications of Computer Vision (WACV), 2015.
    Google ScholarLocate open access versionFindings
  • G. Ros, S. Stent, P. F. Alcantarilla, and T. Watanabe. Training constrained deconvolutional networks for road scene semantic segmentation. arXiv preprint abs/1604.01545, 2016.
    Findings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. Intl. J. of Computer Vision, 2015.
    Google ScholarLocate open access versionFindings
  • B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool for image annotation. Intl. J. of Computer Vision, 2008.
    Google ScholarLocate open access versionFindings
  • T. Scharwchter, M. Enzweiler, U. Franke, and S. Roth. Efficient multi-cue scene segmentation. In Pattern Recognition. 2013.
    Google ScholarFindings
  • J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from a single depth image. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In arXiv preprint abs/1406.2199, 2014.
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • B. Sun and K. Saenko. From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British Machine Vision Conf. (BMVC), 2014.
    Google ScholarLocate open access versionFindings
  • [39] T.L Berg, A. Sorokin, G. Wang, D.A. Forsyth, D. Hoeiem, I. Endres, and A. Farhadi. It’s all about the data. Proceedings of the IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • [40] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • [41] A. Torralba and A. Efros. Unbiased look at dataset bias. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.
    Google ScholarLocate open access versionFindings
  • [42] D. Vazquez, A. Lopez, J. Marın, D. Ponsa, and D. Geronimo. Virtual and real world adaptation for pedestrian detection. IEEE Trans. Pattern Anal. Machine Intell., 2014.
    Google ScholarLocate open access versionFindings
  • [43] L. Woensel and G. Archer. Ten technologies which could change our lives. Technical report, EPRS - European Parlimentary Research Service, January 2015.
    Google ScholarFindings
  • [44] J. Xu, S. Ramos, D. Vazquez, and A. Lopez. Domain adaptation of deformable part-based models. IEEE Trans. Pattern Anal. Machine Intell., 2014.
    Google ScholarLocate open access versionFindings
  • [45] J. Xu, S. Ramos, D. Vazquez, and A. M. Lopez. Hierarchical adaptive structural SVM for domain adaptation. 2016.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments