SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving

CVPR, pp. 11115-11124, 2020.

Cited by: 0|Bibtex|Views156|DOI:https://doi.org/10.1109/CVPR42600.2020.01113
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We present a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle

Abstract:

Autonomous driving system development is critically dependent on the ability to replay complex and diverse traffic scenarios in simulation. In such scenarios, the ability to accurately simulate the vehicle sensors such as cameras, lidar or radar is essential. However, current sensor simulators leverage gaming engines such as Unreal or U...More

Code:

Data:

0
Introduction
  • Recent advances in deep learning have inspired breakthroughs in multiple areas related to autonomous driving such as perception [16, 27], prediction [5, 7] and planning [13]
  • These recent trends only underscore the increasingly significant role of data-driven system development.
  • One aspect is that deep learning networks benefit from large training datasets
  • Another is that autonomous driving system evaluation requires the ability to realistically replay a.
  • Simple ray-casting or ray-tracing techniques are often insufficient to generate realistic camera, LiDAR, or radar data for a specific self-driving system, and additional work is required to adapt the simulated sensor statistics to the real sensors
Highlights
  • Recent advances in deep learning have inspired breakthroughs in multiple areas related to autonomous driving such as perception [16, 27], prediction [5, 7] and planning [13]
  • This representation allows us to render novel views in the scene, corresponding to deviations of the self-driving vehicle and the other agents in the environment from their initially captured trajectories (Sec. 3.1). 2) We propose a Generative Adversarial Networks architecture that takes in the rendered surfel views and synthesizes images with quality and statistics approaching that of real images (Tab. 1) 3) We build the first dataset for reliably evaluating the task of novel view synthesis for autonomous driving, which contains cases in which two self-driving vehicles observe the same scene at the same time
  • Since each surfel may have a different appearance across different frames, due to the variations of the lighting conditions and the changes of relative pose, we propose to enhance the surfel representation by creating a codebook of such k × k grids at n various distances
  • Since during the reconstruction process we know the category for each surfel, we can derive both semantic and instance segmentation masks by first rendering an index map that associates each pixel with a surfel index and determining the semantic class or instance number through a look-up table
  • We propose a simple yet effective data-driven approach, which can synthesize camera data for autonomous driving simulations
  • Based on the camera and LiDAR data captured by a vehicle pass through a scene, we reconstruct a 3D model using our Enhanced surfel renderings Map representation
Results
  • The authors base the experiments mainly on the Waymo Open Dataset [2], but the authors collected two additional datasets in order to obtain a higher quality model and enable a more extensive evaluation.
  • Waymo Open Dataset (WOD) [2].
  • The dataset consists of 798 training (WOD-TRAIN) and 202 validation (WODEVAL) sequences.
  • Each sequence contains 20 seconds of camera and LiDAR data captured at 10Hz, as well as fully annotated 3D bounding boxes for vehicles, pedestrians, and cyclists.
  • After reconstructing the surfel scenes, the authors can render the surfel images in the same pose as the original camera images, generating surfel-image-to-camera-image pairs that can be used for paired training and evaluation.
  • Since during the reconstruction process the authors know the category for each surfel, the authors can derive both semantic and instance segmentation masks by first rendering an index map that associates each pixel with a surfel index and determining the semantic class or instance number through a look-up table
Conclusion
  • The authors propose a simple yet effective data-driven approach, which can synthesize camera data for autonomous driving simulations.
  • Based on the camera and LiDAR data captured by a vehicle pass through a scene, the authors reconstruct a 3D model using the Enhanced Surfel Map representation.
  • Given this representation, the authors can render novel views and configurations of objects in the environment.
  • The authors plan to enhance camera simulation further by improving the dynamic object modeling process and by investigating temporally consistent video generation
Summary
  • Introduction:

    Recent advances in deep learning have inspired breakthroughs in multiple areas related to autonomous driving such as perception [16, 27], prediction [5, 7] and planning [13]
  • These recent trends only underscore the increasingly significant role of data-driven system development.
  • One aspect is that deep learning networks benefit from large training datasets
  • Another is that autonomous driving system evaluation requires the ability to realistically replay a.
  • Simple ray-casting or ray-tracing techniques are often insufficient to generate realistic camera, LiDAR, or radar data for a specific self-driving system, and additional work is required to adapt the simulated sensor statistics to the real sensors
  • Objectives:

    Overview of the proposed system. a) The goal of this work is the generation of camera images for autonomous driving simulation.
  • Results:

    The authors base the experiments mainly on the Waymo Open Dataset [2], but the authors collected two additional datasets in order to obtain a higher quality model and enable a more extensive evaluation.
  • Waymo Open Dataset (WOD) [2].
  • The dataset consists of 798 training (WOD-TRAIN) and 202 validation (WODEVAL) sequences.
  • Each sequence contains 20 seconds of camera and LiDAR data captured at 10Hz, as well as fully annotated 3D bounding boxes for vehicles, pedestrians, and cyclists.
  • After reconstructing the surfel scenes, the authors can render the surfel images in the same pose as the original camera images, generating surfel-image-to-camera-image pairs that can be used for paired training and evaluation.
  • Since during the reconstruction process the authors know the category for each surfel, the authors can derive both semantic and instance segmentation masks by first rendering an index map that associates each pixel with a surfel index and determining the semantic class or instance number through a look-up table
  • Conclusion:

    The authors propose a simple yet effective data-driven approach, which can synthesize camera data for autonomous driving simulations.
  • Based on the camera and LiDAR data captured by a vehicle pass through a scene, the authors reconstruct a 3D model using the Enhanced Surfel Map representation.
  • Given this representation, the authors can render novel views and configurations of objects in the environment.
  • The authors plan to enhance camera simulation further by improving the dynamic object modeling process and by investigating temporally consistent video generation
Tables
  • Table1: Realism w.r.t. an off-the-shelf vehicle object detector. We generated images using the proposed SurfelGAN and ran inference on them using an off-the-shelf object detector. We report the standard COCO object detection metrics [<a class="ref-link" id="c26" href="#r26">26</a>], including variants of the average-precision (AP) and recall at 100 (Rec). Surfel is the surfel rendering that is the input to SurfelGAN. SurfelGAN is the proposed model. The S variant is trained with paired supervised learning only. The SA variant adds the adversarial loss, and the SAC variant makes use of additional unpaired data and applied a cyclic adversarial loss. Real is the real image captured by cameras, which is only available in
  • Table2: Detector metric break down at different perturbation levels on WOD-EVAL-NV by SurfelGAN-SAC
  • Table3: Image-pixel realism. We applied the SurfelGAN on the Dual-Camera-Pose Dataset, where it is possible to measure l1-
  • Table4: Detector metric on Open Dataset validation set when trained with different combination of data
  • Table5: Object detection metric for different surfel image coverage ratios (r) on WOD-EVAL-NV
Download tables as Excel
Related work
  • Simulated Environments for Driving Agents. There have been many efforts towards building simulated environments for various tasks [6, 12, 40, 41, 42]. Much work has focused on indoor environments [6, 40, 42] based on public indoor datasets such as SUNCG [36] or Matterport3D [8]. In contrast to indoor settings where the environment is relatively simple and easy to model, simulators for autonomous driving exhibit significant challenges in modeling the complicated and dynamic scenarios of real-world scenes. TORCS[41] is one of the first simulation environments that support multiagent racing, but is not tailored for real-world autonomous driving research and development. DeepGTAV [1] provides a plugin that transforms the Grand Theft Auto gaming environment into a vision-based self-driving car research environment. CARLA[12] is a popular open-source simulation engine that supports the training and testing of SDVs. All these simulators rely on manual creation of synthetic environments, which is a formidable and laborious process. In CARLA [12], the 3D model of the environment, which includes buildings, road, vegetation, vehicles, and pedestrians, is manually created. The simulator provides one town with 2.9 km of drivable roads for training and another town with 1.4 km of drivable roads for testing. In contrast, our system is easily extendable to new scenes that are driven by an SDV. Furthermore, because the environment we are building is a high-quality reconstruction based on the vehicle sensors, it naturally closes the domain gap between synthetic and real contents, which is present in most traditional simulation environments. Similar to this work, AADS [24] utilizes real sensor data to synthesize novel views. The major difference is that we reconstruct the 3D environment, while AADS uses a purely image-based novel view synthesis. Reconstructing the 3D environment gives us the freedom to synthesize novel views that could not be easily captured in the real world. Moreover, once our environment is built, we no longer need to store the images or query the nearest K views upon synthesis, which saves time for deployment.
Funding
  • Presents a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle
  • Demonstrates our approach on the Waymo Open Dataset and show that it can synthesize realistic camera data for simulated scenarios
  • Proposes a simple yet effective data-driven approach for creating realistic scenario sensor data
  • Describes a pipeline that builds a detailed reconstruction of a dynamic scene from real-world sensor data
  • Proposes a GAN architecture that takes in the rendered surfel views and synthesizes images with quality and statistics approaching that of real images builds the first dataset for reliably evaluating the task of novel view synthesis for autonomous driving, which contains cases in which two self-driving vehicles observe the same scene at the same time
Reference
  • Deepgtav v2. 2
    Google ScholarFindings
  • Waymo open dataset: An autonomous driving dataset, 2019.
    Google ScholarFindings
  • Kara-Ali Aliev, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. arXiv preprint arXiv:1906.08240, 2019. 3
    Findings
  • Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. 3
    Findings
  • Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. In RSS, 2019. 1
    Google ScholarLocate open access versionFindings
  • Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron Courville. Home: A household multimodal environment. arXiv preprint arXiv:1711.11017, 2017. 2
    Findings
  • Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019. 1
    Findings
  • Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017. 2
    Findings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 3
    Google ScholarFindings
  • Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 2018. 3
    Google ScholarLocate open access versionFindings
  • Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, 1996. 3
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017. 1, 2
    Findings
  • Michael Everett, Yu Fan Chen, and Jonathan P How. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In IROS, 2018. 1
    Google ScholarLocate open access versionFindings
  • Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis. TPAMI, 2010. 3
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014. 2, 3
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 1
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 7
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 5
    Findings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, and Sanja Fidler. Meta-sim: Learning to generate synthetic datasets. arXiv preprint arXiv:1904.11621, 2019. 2
    Findings
  • Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. 3
    Findings
  • Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006. 3
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5
    Findings
  • Wei Li, Chengwei Pan, Rong Zhang, Jiaping Ren, Yuexin Ma, Jin Fang, Feilong Yan, Qichuan Geng, Xinyu Huang, Huajun Gong, et al. Aads: Augmented autonomous driving simulation using data-driven algorithms. arXiv preprint arXiv:1901.07849, 2019. 2
    Findings
  • Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017. 4
    Findings
  • Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014. 6
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016. 1, 7
    Google ScholarLocate open access versionFindings
  • Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018. 4
    Findings
  • Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Hanspeter Pfister, Matthias Zwicker, Jeroen Van Baar, and Markus Gross. Surfels: Surface elements as rendering primitives. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, 2000. 3
    Google ScholarLocate open access versionFindings
  • German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016. 2
    Google ScholarFindings
  • A. Segal, D. Haehnel, and S. Thrun. Generalized-icp. In Proceedings of Robotics: Science and Systems, 2009. 3
    Google ScholarLocate open access versionFindings
  • Shaoshuai Shi, Zhe Wang, Xiaogang Wang, and Hongsheng Li. Part-a 2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv preprint arXiv:1907.03670, 2019. 3
    Findings
  • Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Shimon Ullman. The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 1979. 3
    Google ScholarLocate open access versionFindings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018. 3
    Findings
  • Changchang Wu et al. Visualsfm: A visual structure from motion system. 2011. 3
    Google ScholarFindings
  • Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018. 2
    Findings
  • Bernhard Wymann, Eric Espie, Christophe Guionneau, Christos Dimitrakakis, Remi Coulom, and Andrew Sumner. Torcs, the open racing car simulator. Software available at http://torcs.sourceforge.net, 2000.2
    Locate open access versionFindings
  • Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018. 4
    Findings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. 3
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments