3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

JunYoung Gwak
JunYoung Gwak
Kevin Chen
Kevin Chen

ECCV, 2016.

Cited by: 606|Bibtex|Views100
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network

Abstract:

Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection o...More

Code:

Data:

0
Introduction
  • Rapid and automatic 3D object prototyping has become a game-changing innovation in many applications related to e-commerce, visualization, and architecture, to name a few.
  • This trend has been boosted that 3D printing is a democratized technology and 3D acquisition methods are accurate and efficient [2].
  • This is an issue when users wish to reconstruct the object from just a handful of views or ideally just one view (see Fig. 1(a)); ii) objects’ appearances are expected to be Lambertian and the albedos are supposed be non-uniform
Highlights
  • Rapid and automatic 3D object prototyping has become a game-changing innovation in many applications related to e-commerce, visualization, and architecture, to name a few
  • Inspired by the success of Long Short-Term Memory (LSTM) [33] networks [34,35] as well as recent progress in single-view 3D reconstruction using Convolutional Neural Networks [36,37], we propose a novel architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2)
  • 3D Recurrent Reconstruction Neural Network which is suitable for accommodating multi-view image feeds in a principled manner. – We unify single- and multi-view 3D reconstruction in a single framework. – Our approach requires minimal supervision in training and testing. – Our extensive experimental analysis shows that our reconstruction framework outperforms the state-of-the-art method for single-view reconstruction [32]. – Our network enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail
  • We introduce a novel architecture named the 3D Recurrent Reconstruction Network (3D-R2N2), which builds upon the standard Long Short-Term Memory and Gated Recurrent Unit
  • We proposed a novel architecture that unifies single- and multiview 3D reconstruction into a single framework
  • We further tested the network’s ability to perform multi-view reconstruction on the ShapeNet dataset [1] and the Online Products dataset [46], which showed that the network is able to incrementally improve its reconstructions as it sees more views of an object
Methods
  • The authors validate and demonstrate the capability of the approach with several experiments using the datasets described in Section 5.1.
  • The authors compare the performance of the network on the PASCAL 3D [45] dataset with that of a state-of-the-art method by Kar et al [32] for single-view real-world image reconstruction (Section 5.3).
  • The authors show the network’s ability to perform multi-view reconstruction on the ShapeNet dataset [1] and the Online Products dataset [46] (Section 5.4, Section 5.5).
  • The authors compare the approach with a Multi View Stereo method on reconstructing objects with various texture levels and viewpoint sparsity (Section 5.6).
  • The authors split the dataset into training and testing sets, with 4/5
Results
  • As shown in Table 2, the approach outperforms the method of Kar et al [32] in every category.
  • The authors' network trains and reconstructs without knowing the object category.
  • The authors' network does not require object segmentation masks and keypoint labels as additional inputs.
  • Kar et al does demonstrate the possibility of testing on a wild unlabeled image by estimating the segmentation and keypoints.
  • The authors' network outperforms their method tested with ground truth labels
Conclusion
  • The authors proposed a novel architecture that unifies single- and multiview 3D reconstruction into a single framework.
  • Even though the network can take variable length inputs, the authors demonstrated that it outperforms the method of Kar et al [32] in single-view reconstruction using real-world images.
  • The authors analyzed the network’s performance on multi-view reconstruction, finding that the method can produce accurate reconstructions when techniques such as MVS fail.
  • The authors' network does not require a minimum number of input images in order to produce a plausible reconstruction and is able to overcome past challenges of dealing with images which have insufficient texture or wide baseline viewpoints
Summary
  • Introduction:

    Rapid and automatic 3D object prototyping has become a game-changing innovation in many applications related to e-commerce, visualization, and architecture, to name a few.
  • This trend has been boosted that 3D printing is a democratized technology and 3D acquisition methods are accurate and efficient [2].
  • This is an issue when users wish to reconstruct the object from just a handful of views or ideally just one view (see Fig. 1(a)); ii) objects’ appearances are expected to be Lambertian and the albedos are supposed be non-uniform
  • Methods:

    The authors validate and demonstrate the capability of the approach with several experiments using the datasets described in Section 5.1.
  • The authors compare the performance of the network on the PASCAL 3D [45] dataset with that of a state-of-the-art method by Kar et al [32] for single-view real-world image reconstruction (Section 5.3).
  • The authors show the network’s ability to perform multi-view reconstruction on the ShapeNet dataset [1] and the Online Products dataset [46] (Section 5.4, Section 5.5).
  • The authors compare the approach with a Multi View Stereo method on reconstructing objects with various texture levels and viewpoint sparsity (Section 5.6).
  • The authors split the dataset into training and testing sets, with 4/5
  • Results:

    As shown in Table 2, the approach outperforms the method of Kar et al [32] in every category.
  • The authors' network trains and reconstructs without knowing the object category.
  • The authors' network does not require object segmentation masks and keypoint labels as additional inputs.
  • Kar et al does demonstrate the possibility of testing on a wild unlabeled image by estimating the segmentation and keypoints.
  • The authors' network outperforms their method tested with ground truth labels
  • Conclusion:

    The authors proposed a novel architecture that unifies single- and multiview 3D reconstruction into a single framework.
  • Even though the network can take variable length inputs, the authors demonstrated that it outperforms the method of Kar et al [32] in single-view reconstruction using real-world images.
  • The authors analyzed the network’s performance on multi-view reconstruction, finding that the method can produce accurate reconstructions when techniques such as MVS fail.
  • The authors' network does not require a minimum number of input images in order to produce a plausible reconstruction and is able to overcome past challenges of dealing with images which have insufficient texture or wide baseline viewpoints
Tables
  • Table1: Reconstruction performance of 3D-LSTM variations according to crossentropy loss and IoU using 5 views
  • Table2: Per-category reconstruction of PASCAL VOC compared using voxel Intersection-over-Union (IoU). Note that the experiments were ran with the same configuration except that the method of Kar et al [<a class="ref-link" id="c32" href="#r32">32</a>] took ground-truth object segmentation masks and keypoint labels as additional inputs for both training and testing
Download tables as Excel
Funding
  • We acknowledge the support of NSF CAREER grant N.1054127 and Toyota Award #122282
Reference
  • Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: An Information-Rich 3D Model Repository. Technical report, Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)
    Google ScholarFindings
  • Choi, S., Zhou, Q.Y., Miller, S., Koltun, V.: A large dataset of object scans. arXiv preprint arXiv:1602.02481 (2016)
    Findings
  • Fitzgibbon, A., Zisserman, A.: Automatic 3d model acquisition and generation of new images from video sequences. In: Signal Processing Conference (EUSIPCO 1998), 9th European, IEEE (1998) 1–8
    Google ScholarLocate open access versionFindings
  • Lhuillier, M., Quan, L.: A quasi-dense approach to surface reconstruction from uncalibrated images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(3) (2005) 418–433
    Google ScholarLocate open access versionFindings
  • Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: Computer Vision, 2009 IEEE 12th International Conference on, IEEE (2009) 72–79
    Google ScholarLocate open access versionFindings
  • Engel, J., Schops, T., Cremers, D.: Lsd-slam: Large-scale direct monocular slam. In: Computer Vision–ECCV 2014. Springer (2014) 834–849
    Google ScholarFindings
  • Haming, K., Peters, G.: The structure-from-motion reconstruction pipeline–a survey with focus on short image sequences. Kybernetika 46(5) (2010) 926–937
    Google ScholarLocate open access versionFindings
  • Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendon-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artificial Intelligence Review 43(1) (2015) 55– 81
    Google ScholarLocate open access versionFindings
  • Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision 60(2) (2004) 91–110
    Google ScholarLocate open access versionFindings
  • Bhat, D.N., Nayar, S.K.: Ordinal measures for image correspondence. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20(4) (1998) 415–423
    Google ScholarLocate open access versionFindings
  • Saponaro, P., Sorensen, S., Rhein, S., Mahoney, A.R., Kambhamettu, C.: Reconstruction of textureless regions using structure from motion and image-based interpolation. In: Image Processing (ICIP), 2014 IEEE International Conference on, IEEE (2014) 1847–1851
    Google ScholarLocate open access versionFindings
  • Seitz, S.M., Dyer, C.R.: Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision 35(2) (1999) 151–173
    Google ScholarLocate open access versionFindings
  • Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International Journal of Computer Vision 38(3) (2000) 199–218
    Google ScholarLocate open access versionFindings
  • Gregory G Slabaugh, W Bruce Culbertson, T.M., Stevens, M.R., Schafer, R.W.: Methods for volumetric reconstruction of visual scenes. International Journal of Computer Vision 57(3) (2004) 179–199
    Google ScholarLocate open access versionFindings
  • Anwar, Z., Ferrie, F.: Towards robust voxel-coloring: Handling camera calibration errors and partial emptiness of surface voxels. In: Proceedings of the 18th International Conference on Pattern Recognition - Volume 01. ICPR ’06, Washington, DC, USA, IEEE Computer Society (2006) 98–102
    Google ScholarLocate open access versionFindings
  • Broadhurst, A., Drummond, T.W., Cipolla, R.: A probabilistic framework for space carving. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Volume 1., IEEE (2001) 388–393
    Google ScholarLocate open access versionFindings
  • Dame, A., Prisacariu, V.A., Ren, C.Y., Reid, I.: Dense reconstruction using 3d object shape priors. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. (2013)
    Google ScholarLocate open access versionFindings
  • Bao, Y., chandraker, M., Lin, Y., Savarese, S.: Dense object reconstruction using semantic priors. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. (2013)
    Google ScholarLocate open access versionFindings
  • Lawrence, G.R.: Machine perception of three-dimensional solids. Ph. D. Thesis (1963)
    Google ScholarFindings
  • Nevatia, R., Binford, T.O.: Description and recognition of curved objects. Artificial Intelligence 8(1) (1977) 77–98
    Google ScholarLocate open access versionFindings
  • Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3d representations for object modeling and recognition, TPAMI (2013)
    Google ScholarLocate open access versionFindings
  • Rock, J., Gupta, T., Thorsen, J., Gwak, J., Shin, D., Hoiem, D.: Completing 3d object shape from one depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 2484–2493
    Google ScholarLocate open access versionFindings
  • Bongsoo Choy, C., Stark, M., Corbett-Davies, S., Savarese, S.: Enriching object detection with 2d-3d registration and continuous viewpoint estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
    Google ScholarLocate open access versionFindings
  • Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. Pattern Analysis and Machine Intelligence, IEEE Transactions on 25(9) (2003) 1063–1074
    Google ScholarLocate open access versionFindings
  • Matthews, I., Xiao, J., Baker, S.: 2d vs. 3d deformable face models: Representational power, construction, and real-time fitting. International journal of computer vision 75(1) (2007) 93–113
    Google ScholarLocate open access versionFindings
  • Kemelmacher-Shlizerman, I., Basri, R.: 3d face reconstruction from a single image using a single reference face shape. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(2) (2011) 394–405
    Google ScholarLocate open access versionFindings
  • Prisacariu, V.A., Segal, A.V., Reid, I.: Simultaneous monocular 2d segmentation, 3d pose recovery and 3d reconstruction. In: Computer Vision–ACCV 2012. Springer (2012) 593–606
    Google ScholarFindings
  • Sandhu, R., Dambreville, S., Yezzi, A., Tannenbaum, A.: A nonrigid kernel-based framework for 2d-3d pose estimation and 2d image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(6) (2011) 1098–1115
    Google ScholarLocate open access versionFindings
  • Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5) (may 2009) 824–840
    Google ScholarLocate open access versionFindings
  • Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM transactions on graphics (TOG) 24(3) (2005) 577–584
    Google ScholarLocate open access versionFindings
  • Vicente, S., Carreira, J., Agapito, L., Batista, J.: Reconstructing pascal voc. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014)
    Google ScholarLocate open access versionFindings
  • Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE (2015) 1966–1974
    Google ScholarLocate open access versionFindings
  • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8) (November 1997) 1735–1780
    Google ScholarLocate open access versionFindings
  • Sundermeyer, M., Schluter, R., Ney, H.: Lstm neural networks for language modeling. In: INTERSPEECH. (2012) 194–197
    Google ScholarFindings
  • Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. (2014) 3104– 3112
    Google ScholarLocate open access versionFindings
  • Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems 27. (2014)
    Google ScholarLocate open access versionFindings
  • Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition. (2015)
    Google ScholarLocate open access versionFindings
  • Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2) (Mar 1994) 157–166
    Google ScholarLocate open access versionFindings
  • Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv e-prints (2014)
    Google ScholarFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. ArXiv e-prints (December 2015)
    Google ScholarFindings
  • A.Dosovitskiy, J.T.Springenberg, T.Brox: Learning to generate chairs with convolutional neural networks. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
    Google ScholarLocate open access versionFindings
  • Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual object classes challenge 2012 (2011)
    Google ScholarFindings
  • Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). (June 2010)
    Google ScholarLocate open access versionFindings
  • Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. ArXiv e-prints (2014)
    Google ScholarFindings
  • Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: A benchmark for 3d object detection in the wild. In: Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, IEEE (2014) 75–82
    Google ScholarLocate open access versionFindings
  • Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. ArXiv e-prints (2015) 47.: Cg studio (2016) [Online; accessed 14-March-2016].
    Google ScholarLocate open access versionFindings
  • 48. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.: Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG 28(3) (2009) 24 49.: Openmvs: open multi-view stereo reconstruction library (2015) [Online; accessed 14-March-2016].
    Google ScholarLocate open access versionFindings
  • 50. Moulon, P., Monasse, P., Marlet, R.: Global fusion of relative motions for robust, accurate and scalable structure from motion. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 3248–3255
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments