PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation

computer vision and pattern recognition, 2018.

Cited by: 145|Bibtex|Views123|DOI:https://doi.org/10.1109/cvpr.2018.00033
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We present the PointFusion network, which accurately estimates 3D object bounding boxes from image and point cloud information

Abstract:

We present PointFusion, a generic 3D object detection method that leverages both image and 3D point cloud information. Unlike existing methods that either use multistage pipelines or hold sensor and dataset-specific assumptions, PointFusion is conceptually simple and application-agnostic. The image data and the raw point cloud data are in...More

Code:

Data:

0
Introduction
  • The authors focus on 3D object detection, which is a fundamental computer vision problem impacting most autonomous robotics systems including self-driving cars and drones.
  • Methods for 3D box regression from a single image, even including recent deep learning methods such as [21, 36], still have relatively low accuracy especially in depth estimates at longer ranges.
  • Many current real-world systems either use stereo or augment their sensor stack with lidar and radar.
  • The lidar-radar mixed-sensor setup is popular in self-driving cars and is typically handled by a multi-
Highlights
  • We focus on 3D object detection, which is a fundamental computer vision problem impacting most autonomous robotics systems including self-driving cars and drones
  • We show that by combining PointFusion with an off-the-shelf 2D object detector [25], we get comparable or better 3D object detections than the state-of-the-art methods designed for KITTI [3] and SUNRGBD [16, 32, 26]
  • We present the PointFusion network, which accurately estimates 3D object bounding boxes from image and point cloud information
  • The raw point cloud data is directly handled using a PointNet model, which avoids lossy input preprocessing such as quantization or projection
  • We introduce a novel dense fusion network, which combines the image and point cloud representations. It predicts multiple 3D box hypotheses relative to the input 3D points, which serve as spatial anchors, and automatically learns to select the best hypothesis
  • Compare with MV3D [3] The final model outperforms the state of the art method MV3D [3] on the easy category (3% more in AP3D), and has a similar performance on the moderate category (1.5% less in AP3D)
  • We show that with the same architecture and hyper-parameters, our method is able to perform on par or better than methods that hold dataset and sensorspecific assumptions on two drastically different datasets
Methods
  • The authors focus on answering two questions: 1) does PointFusion perform well on different sensor configurations and environments compared to models that hold dataset or sensorspecific assumptions, and 2) do the dense prediction architectures perform better than architectures that directly regress the spatial locations.
  • The official training set contains 7481 images.
  • The authors follow [3] and split the dataset into training and validation sets, each containing around half of the entire set.
  • The authors report model performance on the validation set for all three object categories
Results
  • Evaluation on KITTI

    Overview Table 1 shows a comprehensive comparison of models that are trained and evaluated only with the car category on the KITTI validation set, including all baselines and the state of the art methods 3DOP [2], VeloFCN [18] (LiDAR), and MV3D [3] (LiDAR + rgb).
  • Comparison with the baselines As shown in Table 3, final is the best model variant and outperforms the rgb-d baseline by 6% mAP.
  • This is a much smaller gap than in the KITTI dataset, which shows that the CNN performs well when it is given dense depth information.
  • Failure modes include errors caused by objects which are only partially visible in the image or from cascading errors from the 2D detector
Conclusion
  • Conclusions and Future Work

    The authors present the PointFusion network, which accurately estimates 3D object bounding boxes from image and point cloud information.
  • The authors introduce a novel dense fusion network, which combines the image and point cloud representations.
  • It predicts multiple 3D box hypotheses relative to the input 3D points, which serve as spatial anchors, and automatically learns to select the best hypothesis.
  • Promising directions of future work include combining the 2D detector and the PointFusion network into a single endto-end 3D detector, as well as extending the model with a temporal component to perform joint detection and tracking in video and point cloud streams
Summary
  • Introduction:

    The authors focus on 3D object detection, which is a fundamental computer vision problem impacting most autonomous robotics systems including self-driving cars and drones.
  • Methods for 3D box regression from a single image, even including recent deep learning methods such as [21, 36], still have relatively low accuracy especially in depth estimates at longer ranges.
  • Many current real-world systems either use stereo or augment their sensor stack with lidar and radar.
  • The lidar-radar mixed-sensor setup is popular in self-driving cars and is typically handled by a multi-
  • Methods:

    The authors focus on answering two questions: 1) does PointFusion perform well on different sensor configurations and environments compared to models that hold dataset or sensorspecific assumptions, and 2) do the dense prediction architectures perform better than architectures that directly regress the spatial locations.
  • The official training set contains 7481 images.
  • The authors follow [3] and split the dataset into training and validation sets, each containing around half of the entire set.
  • The authors report model performance on the validation set for all three object categories
  • Results:

    Evaluation on KITTI

    Overview Table 1 shows a comprehensive comparison of models that are trained and evaluated only with the car category on the KITTI validation set, including all baselines and the state of the art methods 3DOP [2], VeloFCN [18] (LiDAR), and MV3D [3] (LiDAR + rgb).
  • Comparison with the baselines As shown in Table 3, final is the best model variant and outperforms the rgb-d baseline by 6% mAP.
  • This is a much smaller gap than in the KITTI dataset, which shows that the CNN performs well when it is given dense depth information.
  • Failure modes include errors caused by objects which are only partially visible in the image or from cascading errors from the 2D detector
  • Conclusion:

    Conclusions and Future Work

    The authors present the PointFusion network, which accurately estimates 3D object bounding boxes from image and point cloud information.
  • The authors introduce a novel dense fusion network, which combines the image and point cloud representations.
  • It predicts multiple 3D box hypotheses relative to the input 3D points, which serve as spatial anchors, and automatically learns to select the best hypothesis.
  • Promising directions of future work include combining the 2D detector and the PointFusion network into a single endto-end 3D detector, as well as extending the model with a temporal component to perform joint detection and tracking in video and point cloud streams
Tables
  • Table1: AP3D results for the car category on the KITTI dataset
  • Table2: AP3D results for models trained on all KITTI classes
Download tables as Excel
Related work
  • We give an overview of the previous work on 6-DoF object pose estimation, which is related to our approach. Geometry-based methods A number of methods focus on estimating the 6-DoF object pose from a single image or an image sequence. These include keypoint matching between 2D images and their corresponding 3D CAD models [1, 5, 37], or aligning 3D-reconstructed models with ground-truth models to recover the object poses [28, 9]. Gupta et al [12] propose to predict a semantic segmentation map as well as object pose hypotheses using a CNN and then align the hypotheses with known object CAD models using ICP. These types of methods rely on strong category shape priors or ground-truth object CAD models, which makes them difficult to scale to larger datasets. In contrary, our generic method estimates both the 6-DoF pose and spatial dimensions of an object without object category knowledge or CAD models. 3D box regression from images The recent advances in deep models have dramatically improved 2D object detection, and some methods propose to extend the objectives with the full 3D object poses. [33] uses R-CNN to propose 2D RoI and another network to regress the object poses. [21] combines a set of deep-learned 3D object parameters and geometric constraints from 2D RoIs to recover the full 3D box. Xiang et al [36, 35] jointly learns a viewpointdependent detector and a pose estimator by clustering 3D voxel patterns learned from object models. Although these methods excel at estimating object orientations, localizing the objects in 3D from an image is often handled by imposing geometric constraints [21] and remains a challenge for lack of direct depth measurements. One of the key contributions of our model is that it learns to effectively combine the complementary image and depth sensor information. 3D box regression from depth data Newer studies have proposed to directly tackle the 3D object detection problem in discretized 3D spaces. Song et al [31] learn to classify 3D bounding box proposals generated by a 3D sliding window using synthetically-generated 3D features. A followup study [32] uses a 3D variant of the Region Proposal Network [25] to generate 3D proposals and uses a 3D ConvNet to process the voxelized point cloud. A similar approach by Li et al [17] focuses on detecting vehicles and processes the voxelized input with a 3D fully convolutional network. However, these methods are often prohibitively expensive because of the discretized volumetric representation. As an example, [32] takes around 20 seconds to process one frame. Other methods, such as VeloFCN [18], focus on a single lidar setup and form a dense depth and intensity image, which is processed with a single 2D CNN. Unlike these (A)
Funding
  • Compare with MV3D [3] The final model also outperforms the state of the art method MV3D [3] on the easy category (3% more in AP3D), and has a similar performance on the moderate category (1.5% less in AP3D)
  • When we train a single model using all 3 KITTI categories final (allclass), we roughly get a 3% further increase, achieving a 6% gain over MV3D on the easy examples and a 0.5% gain on the moderate ones. This shows that our model learns that achieving a reasonable performance requires non-trivial effort
Reference
  • M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3762–3769, 2014. 2
    Google ScholarLocate open access versionFindings
  • X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pages 424–432, 2015. 6, 7
    Google ScholarLocate open access versionFindings
  • X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, 2017. 1, 2, 3, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar. A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 1836–1843. IEEE, 2011
    Google ScholarLocate open access versionFindings
  • A. Collet, M. Martinez, and S. S. Srinivasa. The moped framework: Object recognition and pose estimation for manipulation. The International Journal of Robotics Research, 30(10):1284–1306, 2011. 2
    Google ScholarLocate open access versionFindings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. 4
    Google ScholarLocate open access versionFindings
  • M. Enzweiler and D. M. Gavrila. A multilevel mixture-ofexperts framework for pedestrian classification. IEEE Transactions on Image Processing, 20(10):2967–2979, 2011. 1
    Google ScholarLocate open access versionFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2154, 2014. 4
    Google ScholarLocate open access versionFindings
  • V. Ferrari, T. Tuytelaars, and L. Van Gool. Simultaneous object recognition and segmentation from single or multiple model views. International Journal of Computer Vision, 67(2):159–188, 2006. 2
    Google ScholarLocate open access versionFindings
  • A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • M. Giering, V. Venugopalan, and K. Reddy. Multi-modal sensor registration for vehicle perception via deep neural networks. In High Performance Extreme Computing Conference (HPEC), 2015 IEEE, pages 1–6. IEEE, 2015. 2
    Google ScholarLocate open access versionFindings
  • S. Gupta, P. Arbelaez, R. Girshick, and J. Malik. Aligning 3d models to rgb-d images of cluttered scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4731–4740, 2015. 2
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE international conference on computer vision, 2017. 5
    Google ScholarLocate open access versionFindings
  • L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 202, 4
    Findings
  • J. Lahoud and B. Ghanem. 2d-driven 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4622– 4630, 2017. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • B. Li. 3d fully convolutional network for vehicle detection in point cloud. IROS, 2016. 2
    Google ScholarLocate open access versionFindings
  • B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916, 2016. 1, 2, 4, 6, 7
    Findings
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2017. 1
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5
    Google ScholarLocate open access versionFindings
  • A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3d bounding box estimation using deep learning and geometry. IEEE CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • T.-Y. L. nad Piotr Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In IEEE CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016. 2, 3, 4
    Findings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 4
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1, 2, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Z. Ren and E. B. Sudderth. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1525–1533, 2016. 2, 5, 8
    Google ScholarLocate open access versionFindings
  • G. Riedler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep representations at high resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
    Google ScholarLocate open access versionFindings
  • F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision, 66(3):231–259, 2006. 2
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 5
    Google ScholarLocate open access versionFindings
  • S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgbd scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In European conference on computer vision, pages 634–651. Springer, 2014. 2
    Google ScholarLocate open access versionFindings
  • S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1510–1519, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. arXiv preprint arXiv:1708.06500, 2017. 2
    Findings
  • Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1903–1911, 2015. 2
    Google ScholarLocate open access versionFindings
  • Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Subcategoryaware convolutional neural networks for object proposals and detection. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 924–933. IEEE, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • M. Zhu, K. G. Derpanis, Y. Yang, S. Brahmbhatt, M. Zhang, C. Phillips, M. Lecce, and K. Daniilidis. Single image 3d object detection and pose estimation for grasping. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 3936–3943. IEEE, 2014. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments