Deep Hough Voting for 3D Object Detection in Point Clouds

ICCV, pp. 9276-9285, 2019.

Cited by: 63|Bibtex|Views217|DOI:https://doi.org/10.1109/ICCV.2019.00937
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
In this work we have introduced VoteNet: a simple, yet powerful 3D object detection model inspired by Hough voting

Abstract:

Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect object...More

Code:

Data:

0
Introduction
  • The goal of 3D object detection is to localize and recognize objects in a 3D scene. in this work, the authors aim to estimate oriented 3D bounding boxes as well as semantic classes of objects from point clouds.

    Compared to images, 3D point clouds provide accurate geometry and robustness to illumination changes.
  • [42, 12] extend 2D detection frameworks such as the Faster/Mask R-CNN [37, 11] to 3D.
  • They voxelize the irregular point clouds to regular 3D grids and apply 3D CNN detectors, which fails to leverage sparsity in the data and suffer from high computation cost due to expensive 3D convolutions.
  • [4, 55] project
Highlights
  • The goal of 3D object detection is to localize and recognize objects in a 3D scene
  • Results are summarized in Table 1 and 2
  • A per-category evaluation for ScanNet is provided in the appendix
  • The same set of network hyperparameters was used in both datasets
  • In this work we have introduced VoteNet: a simple, yet powerful 3D object detection model inspired by Hough voting
  • We believe that the synergy of Hough voting and deep learning can be generalized to more applications such as 6D pose estimation, template based detection etc. and expect to see more future research along this line
Methods
  • Methods in comparison

    The authors compare with a wide range of prior art methods. Deep sliding shapes (DSS) [42] and 3D-SIS [12] are both 3D CNN based detectors that combine geometry and RGB cues in object proposal and classification, based on the Faster R-CNN [37] pipeline.
  • Compared with DSS, 3D-SIS introduces a more sophisticated sensor fusion scheme and is able to use multiple RGB views to improve performance.
  • Evaluation metric is average precision with 3D IoU threshold 0.25 as proposed by [40].
  • Note that both COG [38] and 2D-driven [20] use room layout context to boost performance.
  • To have fair comparison with previous methods, the evaluation is on the SUN RGB-D V1 data
Results
  • Results are summarized in Table

    1 and 2.
  • VoteNet outperforms all previous methods by at least 3.7 and 18.4 mAP increase in SUN RGB-D and ScanNet respectively.
  • The authors achieve such improvements when the authors use geometric input only while they used both geometry and RGB images.
  • The scenes are quite diverse and pose multiple challenges including clutter, partiality, scanning artifacts, etc
  • Despite these challenges, the network demonstrates quite robust results.
Conclusion
  • In this work the authors have introduced VoteNet: a simple, yet powerful 3D object detection model inspired by Hough voting.
  • In future work the authors intend to explore how to incorporate RGB images into the detection framework and to utilize the detector in downstream application such as 3D instance segmentation.
  • The authors believe that the synergy of Hough voting and deep learning can be generalized to more applications such as 6D pose estimation, template based detection etc.
  • The authors thank Daniel Huber, Justin Johnson, Georgia Gkioxari and Jitendra Malik for valuable discussions and feedback
Summary
  • Introduction:

    The goal of 3D object detection is to localize and recognize objects in a 3D scene. in this work, the authors aim to estimate oriented 3D bounding boxes as well as semantic classes of objects from point clouds.

    Compared to images, 3D point clouds provide accurate geometry and robustness to illumination changes.
  • [42, 12] extend 2D detection frameworks such as the Faster/Mask R-CNN [37, 11] to 3D.
  • They voxelize the irregular point clouds to regular 3D grids and apply 3D CNN detectors, which fails to leverage sparsity in the data and suffer from high computation cost due to expensive 3D convolutions.
  • [4, 55] project
  • Objectives:

    The authors aim to estimate oriented 3D bounding boxes as well as semantic classes of objects from point clouds.
  • From an input point cloud of size N × 3, with a 3D coordinate for each of the N points, the authors aim to generate M votes, where each vote has both a 3D coordinate and a high dimensional feature vector.
  • Since ScanNetV2 does not provide amodal or oriented bounding box annotation, the authors aim to predict axisaligned bounding boxes instead, as in [12]
  • Methods:

    Methods in comparison

    The authors compare with a wide range of prior art methods. Deep sliding shapes (DSS) [42] and 3D-SIS [12] are both 3D CNN based detectors that combine geometry and RGB cues in object proposal and classification, based on the Faster R-CNN [37] pipeline.
  • Compared with DSS, 3D-SIS introduces a more sophisticated sensor fusion scheme and is able to use multiple RGB views to improve performance.
  • Evaluation metric is average precision with 3D IoU threshold 0.25 as proposed by [40].
  • Note that both COG [38] and 2D-driven [20] use room layout context to boost performance.
  • To have fair comparison with previous methods, the evaluation is on the SUN RGB-D V1 data
  • Results:

    Results are summarized in Table

    1 and 2.
  • VoteNet outperforms all previous methods by at least 3.7 and 18.4 mAP increase in SUN RGB-D and ScanNet respectively.
  • The authors achieve such improvements when the authors use geometric input only while they used both geometry and RGB images.
  • The scenes are quite diverse and pose multiple challenges including clutter, partiality, scanning artifacts, etc
  • Despite these challenges, the network demonstrates quite robust results.
  • Conclusion:

    In this work the authors have introduced VoteNet: a simple, yet powerful 3D object detection model inspired by Hough voting.
  • In future work the authors intend to explore how to incorporate RGB images into the detection framework and to utilize the detector in downstream application such as 3D instance segmentation.
  • The authors believe that the synergy of Hough voting and deep learning can be generalized to more applications such as 6D pose estimation, template based detection etc.
  • The authors thank Daniel Huber, Justin Johnson, Georgia Gkioxari and Jitendra Malik for valuable discussions and feedback
Tables
  • Table1: Table 1
  • Table2: Table 2
  • Table3: Comparing VoteNet with a no-vote baseline. Metric is 3D object detection mAP. VoteNet estimate object bounding boxes from vote clusters. BoxNet proposes boxes directly from seed points on object surfaces without voting
  • Table4: Model size and processing time (per frame or scan). Our method is more than 4× more compact in model size than [<a class="ref-link" id="c34" href="#r34">34</a>] and more than 20× faster than [<a class="ref-link" id="c12" href="#r12">12</a>]
  • Table5: Backbone network architecture: layer specifications
  • Table6: Table 6
  • Table7: Table 7
  • Table8: Effects of seed context for 3D detection. Evaluation metric is mAP@0.25 on SUN RGB-D
  • Table9: Effects of number of votes per seed. Evaluation metric is mAP@0.25 on SUN RGB-D. If random number is on, we concatenate a random number to the seed feature before voting, which helps break symmetry in the case of multiple votes per seed
  • Table10: Effects of proposal sampling. Evaluation metric is mAP@0.25 on SUN RGB-D. 256 proposals are used for all evaluations. Our method is not sensitive to how we choose centers for vote groups/clusters
  • Table11: Effects of the height feature. Evaluation metric is mAP@0.25 on both datasets
Download tables as Excel
Related work
  • Due to the complexity of directly working in 3D, especially in large scenes, many methods resort to some type of projection. For example in MV3D [4] and VoxelNet [55], the 3D data is first reduced to a bird’s-eye view before proceeding to the rest of the pipeline. A reduction in search space by first processing a 2D input was demonstrated in both Frustum PointNets [34] and [20]. Similarly, in [16] a segmentation hypothesis is verified using the 3D map. More recently, deep networks on point clouds are used to exploit sparsity of the data by GSPN [54] and PointRCNN [39].

    Hough voting for object detection. Originally introduced in the late 1950s, the Hough transform [13] translates the problem of detecting simple patterns in point samples to detecting peaks in a parametric space. The Generalized Hough Transform [2] further extends this technique to image patches as indicators for the existence of a complex object. Examples of using Hough voting include the seminal work of [24] which introduced the implicit shape model, planes extraction from 3D point clouds [3], and 6D pose estimation [44] to name a few.
Funding
  • This work was supported in part by ONR MURI grant N00014-13-1-0341, NSF grant IIS1763268 and a Vannevar Bush Faculty Fellowship
Reference
  • Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018. 2
    Findings
  • Dana H Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern recognition, 13(2):111–122, 1981. 2
    Google ScholarLocate open access versionFindings
  • Dorit Borrmann, Jan Elseberg, Kai Lingemann, and Andreas Nuchter. The 3d hough transform for plane detection in point clouds: A review and a new accumulator design. 3D Research, 2(2):3, 2011. 2
    Google ScholarLocate open access versionFindings
  • Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 2, 5
    Google ScholarLocate open access versionFindings
  • Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 11
    Google ScholarLocate open access versionFindings
  • Juergen Gall and Victor Lempitsky. Class-specific hough forests for object detection. In Decision forests for computer vision and medical image analysis, pages 143–15Springer, 2013. 2
    Google ScholarFindings
  • Juergen Gall, Angela Yao, Nima Razavi, Luc Van Gool, and Victor Lempitsky. Hough forests for object detection, tracking, and action recognition. IEEE transactions on pattern analysis and machine intelligence, 33(11):2188–2202, 2011. 2
    Google ScholarLocate open access versionFindings
  • Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9224–9232, 2018. 2, 4
    Google ScholarLocate open access versionFindings
  • Paul Guerrero, Yanir Kleiman, Maks Ovsjanikov, and Niloy J Mitra. Pcpnet learning local shape properties from raw point clouds. In Computer Graphics Forum, volume 37, pages 75–85. Wiley Online Library, 2018. 4
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. arXiv preprint arXiv:1703.06870, 2017. 1, 6
    Findings
  • Ji Hou, Angela Dai, and Matthias Nießner. 3D-SIS: 3d semantic instance segmentation of rgb-d scans. arXiv preprint arXiv:1812.07003, 2018. 1, 2, 5, 6, 7, 13
    Findings
  • Paul VC Hough. Machine analysis of bubble chamber pictures. In Conf. Proc., volume 590914, pages 554–558, 1959. 2
    Google ScholarLocate open access versionFindings
  • Li Huan, Qin Yujian, and Wang Li. Vehicle logo retrieval based on hough transform and deep learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 967–973, 2017. 2
    Google ScholarLocate open access versionFindings
  • Wadim Kehl, Fausto Milletari, Federico Tombari, Slobodan Ilic, and Nassir Navab. Deep learning of local rgb-d patches for 3d object detection and 6d pose estimation. In European Conference on Computer Vision, pages 205–220. Springer, 2016. 2
    Google ScholarLocate open access versionFindings
  • Byung-soo Kim, Shili Xu, and Silvio Savarese. Accurate localization of 3d objects from rgb-d data using segmentation hypotheses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3182– 3189, 2013. 2
    Google ScholarLocate open access versionFindings
  • Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 202
    Google ScholarLocate open access versionFindings
  • Jan Knopp, Mukta Prasad, and Luc Van Gool. Orientation invariant 3d object classification using hough transform based methods. In Proceedings of the ACM workshop on 3D object retrieval, pages 15–20. ACM, 2010. 2
    Google ScholarLocate open access versionFindings
  • Jan Knopp, Mukta Prasad, and Luc Van Gool. Scene cut: Class-specific object detection and segmentation in 3d scenes. In 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, pages 180–187. IEEE, 2011. 2
    Google ScholarLocate open access versionFindings
  • Jean Lahoud and Bernard Ghanem. 2d-driven 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4622–4630, 2017. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4558–4567, 2018. 4
    Google ScholarLocate open access versionFindings
  • Truc Le and Ye Duan. Pointgrid: A deep network for 3d shape understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9204– 9214, 2018. 2
    Google ScholarLocate open access versionFindings
  • Bastian Leibe, Ales Leonardis, and Bernt Schiele. Combined object categorization and segmentation with an implicit shape model. In Workshop on statistical learning in computer vision, ECCV, volume 2, page 7, 2004. 1
    Google ScholarLocate open access versionFindings
  • Bastian Leibe, Ales Leonardis, and Bernt Schiele. Robust object detection with interleaved categorization and segmentation. International journal of computer vision, 77(13):259–289, 2008. 2, 3
    Google ScholarLocate open access versionFindings
  • Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pages 828–838, 2018. 2, 4
    Google ScholarLocate open access versionFindings
  • Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nießner. Database-assisted object retrieval for real-time 3d reconstruction. In Computer Graphics Forum, volume 34. Wiley Online Library, 2015. 2
    Google ScholarLocate open access versionFindings
  • Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3d object detection with rgbd cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 1417–1424, 2013. 2
    Google ScholarLocate open access versionFindings
  • Or Litany, Tal Remez, Daniel Freedman, Lior Shapira, Alex Bronstein, and Ran Gal. Asist: automatic semantically invariant scene transformation. CVIU, 157:284–299, 2017. 2
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
    Google ScholarFindings
  • Subhransu Maji and Jitendra Malik. Object detection using a max-margin hough transform. 2009. 2
    Google ScholarFindings
  • Fausto Milletari, Seyed-Ahmad Ahmadi, Christine Kroll, Annika Plate, Verena Rozanski, Juliana Maiostre, Johannes Levin, Olaf Dietrich, Birgit Ertl-Wagner, Kai Botzel, et al. Hough-cnn: deep learning for segmentation of deep brain regions in mri and ultrasound. Computer Vision and Image Understanding, 164:92–102, 2017. 2
    Google ScholarLocate open access versionFindings
  • Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2012), 31(6), 2012. 2
    Google ScholarLocate open access versionFindings
  • David Novotny, Samuel Albanie, Diane Larlus, and Andrea Vedaldi. Semi-convolutional operators for instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 86–102, 2018. 2
    Google ScholarLocate open access versionFindings
  • Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgbd data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018. 1, 2, 4, 5, 6, 7, 11
    Google ScholarLocate open access versionFindings
  • Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 2, 4
    Google ScholarLocate open access versionFindings
  • Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017. 1, 2, 3, 4, 5, 11
    Findings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1, 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Zhile Ren and Erik B Sudderth. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1525–1533, 2016. 2, 6
    Google ScholarLocate open access versionFindings
  • Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. arXiv preprint arXiv:1812.04244, 2018. 2
    Findings
  • Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 567–576, 2015. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. In Computer Vision–ECCV 2014, pages 634–651. Springer, 2014. 2
    Google ScholarLocate open access versionFindings
  • Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018. 2, 4
    Google ScholarLocate open access versionFindings
  • Min Sun, Gary Bradski, Bing-Xin Xu, and Silvio Savarese. Depth-encoded hough voting for joint object detection and shape recovery. In European Conference on Computer Vision, pages 658–671. Springer, 2010. 2
    Google ScholarLocate open access versionFindings
  • Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. arXiv preprint arXiv:1703.09438, 2017. 2
    Findings
  • Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and QianYi Zhou. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2018. 2
    Google ScholarLocate open access versionFindings
  • Alexander Velizhev, Roman Shapovalov, and Konrad Schindler. Implicit shape models for object detection in 3d point clouds. In International Society of Photogrammetry and Remote Sensing Congress, volume 2, 2012. 2
    Google ScholarLocate open access versionFindings
  • Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017. 2
    Google ScholarLocate open access versionFindings
  • Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018. 2
    Findings
  • Oliver J Woodford, Minh-Tri Pham, Atsuto Maki, Frank Perbet, and Bjorn Stenger. Demisting the hough transform for 3d shape recognition and registration. International Journal of Computer Vision, 106(3):332–341, 2014. 2
    Google ScholarLocate open access versionFindings
  • Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pages 87–102, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215, 2018. 2
    Google ScholarLocate open access versionFindings
  • Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas Guibas. Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. arXiv preprint arXiv:1812.03320, 2018. 2, 6
    Findings
  • Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. 1, 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments