Autolabeling 3D Objects with Differentiable Rendering of SDF Shape Priors

CVPR, pp. 12221-12230, 2019.

Cited by: 3|Bibtex|Views20|DOI:https://doi.org/10.1109/CVPR42600.2020.01224
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We show that our approach can recover a substantial amount of cuboid labels with high precision, and that these labels can be used to train 3D object detectors with results close to the state of the art

Abstract:

We present an automatic annotation pipeline to recover 9D cuboids and 3D shape from pre-trained off-the-shelf 2D detectors and sparse LIDAR data. Our autolabeling method solves this challenging ill-posed inverse problem by relying on learned shape priors and optimization of geometric and physical parameters. To that end, we propose a no...More

Code:

Data:

0
Introduction
  • Deep learning methods require large labeled datasets to achieve state-of-the-art performance.
  • Existing approaches for scaling up annotation pipelines include the usage of better tooling, active learning, or a combination thereof [22, 16, 39, 25, 4]
  • Such approaches often rely on heuristics and require human effort to correct the outcomes of semiautomatic labeling, for difficult edge cases.
Highlights
  • Deep learning methods require large labeled datasets to achieve state-of-the-art performance
  • Afterwards, we introduce our differentiable rendering approach tailored towards implicit surface representations
  • We find its nearest neighbor based on normalized object coordinate spaces distances: j∗ = arg min ||pci − ljc||2 (9)
  • We present a novel view on parametric 3D instance recovery in the wild based on a self-improving autolabeling pipeline, purely bootstrapped from synthetic data and off-the-shelf detectors
  • Fundamental to our approach is the combination of dense surface coordinates with a shape space, and our contribution towards differentiable rendering of signed distance fields
  • We show that our approach can recover a substantial amount of cuboid labels with high precision, and that these labels can be used to train 3D object detectors with results close to the state of the art
Results
  • The authors measure the amount of LIDAR points that are in a narrow band (0.2m) around the surface of an autolabel and reject it if less than 60% are outside this band.
  • The estimated NS scores are quite high, indicating that most autolabels are within one meter of the real location.
  • The poseoptimized autolabels yield a significant jump in 3D IoU (41.85% vs 63.42%), suggesting that the authors recover substantially better rotations, given that the NS scores are similar
Conclusion
  • The authors present a novel view on parametric 3D instance recovery in the wild based on a self-improving autolabeling pipeline, purely bootstrapped from synthetic data and off-the-shelf detectors.
  • Fundamental to the approach is the combination of dense surface coordinates with a shape space, and the contribution towards differentiable rendering of SDFs. The authors show that the approach can recover a substantial amount of cuboid labels with high precision, and that these labels can be used to train 3D object detectors with results close to the state of the art.
  • Future work will be focused on investigating additional categories for parametric reconstruction, such as pedestrians or road surfaces
Summary
  • Introduction:

    Deep learning methods require large labeled datasets to achieve state-of-the-art performance.
  • Existing approaches for scaling up annotation pipelines include the usage of better tooling, active learning, or a combination thereof [22, 16, 39, 25, 4]
  • Such approaches often rely on heuristics and require human effort to correct the outcomes of semiautomatic labeling, for difficult edge cases.
  • Results:

    The authors measure the amount of LIDAR points that are in a narrow band (0.2m) around the surface of an autolabel and reject it if less than 60% are outside this band.
  • The estimated NS scores are quite high, indicating that most autolabels are within one meter of the real location.
  • The poseoptimized autolabels yield a significant jump in 3D IoU (41.85% vs 63.42%), suggesting that the authors recover substantially better rotations, given that the NS scores are similar
  • Conclusion:

    The authors present a novel view on parametric 3D instance recovery in the wild based on a self-improving autolabeling pipeline, purely bootstrapped from synthetic data and off-the-shelf detectors.
  • Fundamental to the approach is the combination of dense surface coordinates with a shape space, and the contribution towards differentiable rendering of SDFs. The authors show that the approach can recover a substantial amount of cuboid labels with high precision, and that these labels can be used to train 3D object detectors with results close to the state of the art.
  • Future work will be focused on investigating additional categories for parametric reconstruction, such as pedestrians or road surfaces
Tables
  • Table1: Cuboid autolabel quality when inputting into the CSS network (a) 2D ground truth boxes, (b) RCNN detections, and (c) Mask-RCNN detections. We run two self-improving loops to slowly incorporate more labels into the pool
  • Table2: The performance comparison of the 3D object detectors trained on the true KITTI labels vs. our autolabels. Concerning the BEV metric, the detectors trained on autolabels alone achieve the results equal to the current state of the art. In the case of the 3D AP metric, the competitive results are achieved in both considered variants at the IoU 0.5 threshold
  • Table3: Ablation study over each optimization variable and each separate loss
Download tables as Excel
Related work
  • In recent years, assisted labeling has gained growing attention as the increasing amount of data hinders the ability to label it manually. In [41], the authors utilize a 2D detector to seed 2D box annotations further refined by humans and report an increase of 60% in the overall labeling speed. The authors in [2] train a recurrent CNN to predict polygons on an image to accelerate semantic segmentation tasks. A follow-up work [25] further improves the system by predicting all polygon vertices simultaneously and facilitates real-time interaction. In [22], the authors propose a 3D labeling interface that enables the users selecting spatial seeds to infer segmentation, 3D centroid, orientation, and extent using pretrained networks. In [10], 2D labels are used to seed a LIDAR-based detector combined with human annotation based on uncertainty. All mentioned works are active learning frameworks in which a human is assisted by predictive models. Instead, we aim to investigate how well an automatic pipeline with geometric verification can perform in this context.
Funding
  • Presents an automatic annotation pipeline to recover 9D cuboids and 3D shapes from pre-trained off-the-shelf 2D detectors and sparse LIDAR data
  • Demonstrates that differentiable visual alignment, referred to as “analysis-by-synthesis” or “render-andcompare” , is a powerful approach towards autolabeling for the purpose of autonomous driving
  • Evaluates our approach on the KITTI3D dataset and show that our method can be used to accurately recover metric cuboids with structural, differentiable priors
  • Demonstrates that such cuboids can be leveraged to train efficient 3D object detectors
Reference
  • Parallel domain: Data generation for autonomy. https://www.paralleldomain.com/.3
    Findings
  • David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv, 2019. 6
    Google ScholarLocate open access versionFindings
  • Wenzheng Chen, Jun Gao, Huan Ling, Edward J. Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. In NeurIPS, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In VPR, 2017. 8
    Google ScholarLocate open access versionFindings
  • Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In SIGGRAPH, 1993
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2013
    Google ScholarLocate open access versionFindings
  • Francis Engelmann, Jorg Stuckler, and Bastian Leibe. Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In GCPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Francis Engelmann, Jorg Stuckler, and Bastian Leibe. Samp: Shape and motion priors for 4d vehicle reconstruction. In WACV, 2017. 3
    Google ScholarLocate open access versionFindings
  • Di Feng, Xiao Wei, Lars Rosenbaum, Atsuto Maki, and Klaus Dietmayer. Deep active learning for efficient training of a lidar 3d object detector. In IV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 202, 6
    Google ScholarLocate open access versionFindings
  • Rıza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B. Girshick. Mask R-CNN. In ICCV, 2017. 6
    Google ScholarLocate open access versionFindings
  • Yoshitaka Ushiku Hiroharu Kato and Tatsuya Harada. Neural 3d mesh renderer. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Calvin Huang. Adding a dimension: Annotating 3d objects with 2d data. 2018. https://scale.com/blog/3d-cuboids-annotations.1
    Findings
  • Omid Hosseini Jafari, Siva Karthik Mustikovela, Karl Pertsch, Eric Brachmann, and Carsten Rother. ipose: instance-aware 6d pose estimation of partly occluded objects. In ACCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-specific mesh reconstruction from image collections. In ECCV, 202
    Google ScholarLocate open access versionFindings
  • Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical surface mapping via geometric cycle consistency. In ICCV, 202
    Google ScholarLocate open access versionFindings
  • Abhijit Kundu, Yin Li, and James M. Rehg. 3d-rcnn: Instance-level 3d object reconstruction via render-andcompare. In CVPR, 2018. 2
    Google ScholarFindings
  • Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019. 7, 8
    Google ScholarLocate open access versionFindings
  • Jungwook Lee, Sean Walsh, Ali Harakeh, and Steven Waslander. Leveraging pre-trained 3d object detection models for fast ground truth generation. In ITSC, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Tzu-Mao Li, Miika Aittala, Fredo Durand, and Jaakko Lehtinen. Differentiable monte carlo ray tracing through edge sampling. In SIGGRAPH Asia, 2018. 2
    Google ScholarLocate open access versionFindings
  • Zhigang Li, Gu Wang, and Xiangyang Ji. Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with curve-gcn. In CVPR, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In ICCV, 2019. 2, 4
    Google ScholarLocate open access versionFindings
  • Matthew M. Loper and Michael J. Black. OpenDR: An approximate differentiable renderer. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In SIGGRAPH, 1987. 3
    Google ScholarLocate open access versionFindings
  • Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. Roi10d: Monocular lifting of 2d detection to 6d pose and metric shape. In CVPR, 2019. 2, 7
    Google ScholarLocate open access versionFindings
  • Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Kiru Park, Timothy Patten, and Markus Vincze. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. 6
    Google ScholarLocate open access versionFindings
  • Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Hanspeter Pfister, Matthias Zwicker, Jeroen van Baar, and Markus Gross. Surfels: Surface elements as rendering primitives. In SIGGRAPH, 2000. 4
    Google ScholarLocate open access versionFindings
  • P.H. Schnemann. A generalized solution of the orthogonal procrustes problem. In Psychometrika, 1966. 5
    Google ScholarLocate open access versionFindings
  • Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel Lopez-Antequera, and Peter Kontschieder. Disentangling monocular 3d object detection. In ICCV, 2019. 7, 8
    Google ScholarLocate open access versionFindings
  • David Stutz and Andreas Geiger. Learning 3d shape completion under weak supervision. IJCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Zian Wang, Huan Ling, David Acuna, Amlan Kar, and Sanja Fidler. Object instance annotation with deep extreme level set evolution. In CVPR, 2019. 1
    Google ScholarLocate open access versionFindings
  • Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.6
    Findings
  • Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. In arXiv, 2018. 2
    Google ScholarLocate open access versionFindings
  • Alan Yuille and Daniel Kersten. Vision as bayesian inference: analysis by synthesis? Trends in cognitive sciences, 10(7):301–308, 2006. 2
    Google ScholarLocate open access versionFindings
  • Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Dpod: Dense 6d pose object detector in rgb images. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Jason Y. Zhang, Panna Felsen, Angjoo Kanazawa, and Jitendra Malik. Predicting 3d human dynamics from video. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Silvia Zuffi, Angjoo Kanazawa, Tanya Berger-Wolf, and Michael J. Black. Three-d safari: Learning to estimate zebra pose, shape, and texture from images ”in the wild”. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments