AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce Scene Representation Networks, a 3D-structured neural scene representation that implicitly represents a scene as a continuous, differentiable function

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), (2019): 1119-1130

Cited by: 268|Views696
EI
Full Text
Bibtex
Weibo

Abstract

Unsupervised learning with generative models has the potential of discovering rich representations of 3D scenes. While geometric deep learning has explored 3D-structure-aware representations of scene geometry, these models typically require explicit 3D supervision. Emerging neural scene representations can be trained only with posed 2D im...More

Code:

Data:

0
Introduction
  • A major driver behind recent work on generative models has been the promise of unsupervised discovery of powerful neural scene representations, enabling downstream tasks ranging from robotic manipulation and few-shot 3D reconstruction to navigation.
  • Multi-view geometry and projection operations are performed by a black-box neural renderer, which is expected to learn these operations from data
  • As a result, such approaches fail to discover 3D structure under limited training data, lack guarantees on multi-view consistency of the rendered images, and learned representations are generally not interpretable.
  • These scene representations are discrete, limiting achievable spatial resolution, only sparsely sampling the underlying smooth surfaces of a scene, and often require explicit 3D supervision
Highlights
  • A major driver behind recent work on generative models has been the promise of unsupervised discovery of powerful neural scene representations, enabling downstream tasks ranging from robotic manipulation and few-shot 3D reconstruction to navigation
  • We introduce Scene Representation Networks (SRNs), a continuous neural scene representation, along with a differentiable rendering algorithm, that model both 3D scene geometry and appearance, enforce 3D structure in a multi-view consistent manner, and naturally allow generalization of shape and appearance priors across scenes
  • We introduce SRNs, a 3D-structured neural scene representation that implicitly represents a scene as a continuous, differentiable function
  • This function maps 3D coordinates to a feature-based representation of the scene and can be trained end-to-end with a differentiable ray marcher to render the feature-based representation into a set of 2D images
  • SRNs could be explored in a probabilistic framework [2, 3], enabling sampling of feasible scenes given a set of observations
  • As SRNs are differentiable with respect to camera parameters; future work may alternatively integrate them with learned algorithms for camera pose estimation [72]
Methods
  • The authors train SRNs on several object classes and evaluate them for novel view synthesis and few-shot reconstruction.
  • Please see the supplement for a comparison on single-scene novel view synthesis performance with DeepVoxels [6].
  • Hyperparameters, computational complexity, and full network architectures for SRNs and all baselines are in the supplement.
  • Training of the presented models takes on the order of 6 days.
  • A single forward pass takes around 120 ms and 3 GB of GPU memory per batch item.
Conclusion
  • The authors introduce SRNs, a 3D-structured neural scene representation that implicitly represents a scene as a continuous, differentiable function
  • This function maps 3D coordinates to a feature-based representation of the scene and can be trained end-to-end with a differentiable ray marcher to render the feature-based representation into a set of 2D images.
  • SRNs could be extended to model view- and lighting-dependent effects, translucency, and participating media.
  • As SRNs are differentiable with respect to camera parameters; future work may alternatively integrate them with learned algorithms for camera pose estimation [72].
  • Please see the supplemental material for further details on directions for future work
Tables
  • Table1: PSNR (in dB) and SSIM of images reconstructed with our method, the deterministic variant of the GQN [<a class="ref-link" id="c2" href="#r2">2</a>] (dGQN), the model proposed by Tatarchenko et al [<a class="ref-link" id="c1" href="#r1">1</a>] (TCO), and the method proposed by Worrall et al [<a class="ref-link" id="c4" href="#r4">4</a>] (WRL). We compare novel-view synthesis performance on objects in the training set (containing 50 images of each object), as well as reconstruction from 1 or 2 images on the held-out test set
Download tables as Excel
Related work
  • Our approach lies at the intersection of multiple fields. In the following, we review related work.

    Geometric Deep Learning. Geometric deep learning has explored various representations to reason about scene geometry. Discretization-based techniques use voxel grids [7, 16,17,18,19,20,21,22], octree hierarchies [23,24,25], point clouds [11, 26, 27], multiplane images [28], patches [29], or meshes [15, 21, 30, 31]. Methods based on function spaces continuously represent space as the decision boundary of a learned binary classifier [32] or a continuous signed distance field [33,34,35]. While these techniques are successful at modeling geometry, they often require 3D supervision, and it is unclear how to efficiently infer and represent appearance. Our proposed method encapsulates both scene geometry and appearance, and can be trained end-to-end via learned differentiable rendering, supervised only with posed 2D images.
Funding
  • Vincent Sitzmann was supported by a Stanford Graduate Fellowship
  • Michael Zollhöfer was supported by the Max Planck Center for Visual Computing and Communication (MPC-VCC)
  • Gordon Wetzstein was supported by NSF awards (IIS 1553333, CMMI 1839974), by a Sloan Fellowship, by an Okawa Research Grant, and a PECASE
Study subjects and analysis
observations: 15
We evaluate our approach on 7-element Shepard-Metzler objects in a limited-data setting. We render 15 observations of 1k objects at a resolution of 64 × 64. We train both SRNs and a deterministic variant of the Generative Query Network [2] (dGQN, please see supplement for an extended discussion)

observations: 50
We consider the “chair” and “car” classes of Shapenet v.2 [39] with 4.5k and 2.5k model instances respectively. We disable transparencies and specularities, and train on 50 observations of each instance at a resolution of 128 × 128 pixels. Camera poses are randomly generated on a sphere with the object at the origin

observations: 500
We demonstrate reconstruction of a room-scale scene with SRNs. We train a single SRN on 500 observations of a minecraft room. The room contains multiple objects as well as four columns, such that parts of the scene are occluded in most observations

observations: 15
Overview: at the heart of SRNs lies a continuous, 3D-aware neural scene representation, Φ, which represents a scene as a function that maps (x, y, z) world coordinates to a feature representation of the scene at those coordinates (see Sec. 3.1). A neural renderer Θ, consisting of a learned ray marcher and a pixel generator, can render the scene from arbitrary novel view points (see Sec. 3.2). Shepard-Metzler object from 1k-object training set, 15 observations each. SRNs (right) outperform dGQN (left) on this small dataset. Non-rigid animation of a face. Note that mouth movement is directly reflected in the normal maps

observations: 50
Interpolating latent code vectors of cars and chairs in the Shapenet dataset while rotating the camera around the model. Features smoothly transition from one model to another. Qualitative comparison with Tatarchenko et al [1] and the deterministic variant of the GQN [2], for novel view synthesis on the Shapenet v2 “cars” and “chairs” classes. We compare novel views for objects reconstructed from 50 observations in the training set (top row), two observations and a single observation (second and third row) from a test set. SRNs consistently outperforms these baselines with multi-view consistent novel views, while also reconstructing geometry. Please see the supplemental video for more comparisons, smooth camera trajectories, and reconstructed geometry. Single- (left) novel-view synthesis on objects in the held-out, official Shapenet v2 test and two-shot (both) refsets, reconstructed from one or two observations, as discussed in Sec. 3.4. erence views

Reference
  • M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Single-view to multi-view: Reconstructing unseen views with a convolutional network,” CoRR abs/1511.06702, vol. 1, no. 2, p. 2, 2015.
    Findings
  • S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor et al., “Neural scene representation and rendering,” Science, vol. 360, no. 6394, pp. 1204–1210, 2018.
    Google ScholarLocate open access versionFindings
  • A. Kumar, S. A. Eslami, D. Rezende, M. Garnelo, F. Viola, E. Lockhart, and M. Shanahan, “Consistent jumpy predictions for videos and scenes,” 2018.
    Google ScholarFindings
  • D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Interpretable transformations with encoder-decoder networks,” in Proc. ICCV, vol. 4, 2017.
    Google ScholarLocate open access versionFindings
  • D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IROS, September 2015, p. 922 – 928.
    Google ScholarLocate open access versionFindings
  • V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer, “Deepvoxels: Learning persistent 3d feature embeddings,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • A. Kar, C. Häne, and J. Malik, “Learning a multi-view stereo machine,” in Proc. NIPS, 2017, pp. 365–376.
    Google ScholarLocate open access versionFindings
  • H.-Y. F. Tung, R. Cheng, and K. Fragkiadaki, “Learning spatial common sense with geometry-aware recurrent networks,” Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • T. H. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang, “Rendernet: A deep convolutional network for differentiable rendering from 3d shapes,” in Proc. NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, and B. Freeman, “Visual object networks: image generation with disentangled 3d representations,” in Proc. NIPS, 2018, pp. 118–129.
    Google ScholarLocate open access versionFindings
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • E. Insafutdinov and A. Dosovitskiy, “Unsupervised learning of shape and pose with differentiable point clouds,” in Proc. NIPS, 2018, pp. 2802–2812.
    Google ScholarLocate open access versionFindings
  • M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla, “Neural rerendering in the wild,” Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • C.-H. Lin, C. Kong, and S. Lucey, “Learning efficient point cloud generation for dense 3d object reconstruction,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson, “Learning free-form deformations for 3d object reconstruction,” CoRR, 2018.
    Google ScholarLocate open access versionFindings
  • S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervision for single-view reconstruction via differentiable ray consistency,” in Proc. CVPR.
    Google ScholarLocate open access versionFindings
  • J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” in Proc. NIPS, 2016, pp. 82–90.
    Google ScholarLocate open access versionFindings
  • M. Gadelha, S. Maji, and R. Wang, “3d shape induction from 2d views of multiple objects,” in 3DV. IEEE Computer Society, 2017, pp. 402–411.
    Google ScholarLocate open access versionFindings
  • C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas, “Volumetric and multi-view cnns for object classification on 3d data,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman, “Pix3d: Dataset and methods for single-image 3d shape modeling,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3d structure from images,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified approach for single and multi-view 3d object reconstruction,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • G. Riegler, A. O. Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs,” in Proc. ICCV, 2017, pp. 2107–2115.
    Google ScholarLocate open access versionFindings
  • C. Haene, S. Tulsiani, and J. Malik, “Hierarchical surface prediction,” Proc. PAMI, pp. 1–1, 2019.
    Google ScholarLocate open access versionFindings
  • P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning representations and generative models for 3D point clouds,” in Proc. ICML, 2018, pp. 40–49.
    Google ScholarLocate open access versionFindings
  • M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Multi-view 3d models from single images with a convolutional network,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: learning view synthesis using multiplane images,” ACM Trans. Graph., vol. 37, no. 4, pp. 65:1–65:12, 2018.
    Google ScholarLocate open access versionFindings
  • T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “Atlasnet: A papier-mâché approach to learning 3d surface generation,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” in Proc. CVPR, 2018, pp. 3907–3916.
    Google ScholarLocate open access versionFindings
  • A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learning category-specific mesh reconstruction from image collections,” in ECCV, 2018.
    Google ScholarFindings
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” arXiv preprint arXiv:1901.05103, 2019.
    Findings
  • K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser, “Learning shape templates with structured implicit functions,” Proc. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • B. Deng, K. Genova, S. Yazdani, S. Bouaziz, G. Hinton, and A. Tagliasacchi, “Cvxnets: Learnable convex decomposition,” arXiv preprint arXiv:1909.05736, 2019.
    Findings
  • T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang, “Hologan: Unsupervised learning of 3d representations from natural images,” in Proc. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • F. Alet, A. K. Jeewajee, M. Bauza, A. Rodriguez, T. Lozano-Perez, and L. P. Kaelbling, “Graph element networks: adaptive, structured computation and memory,” in Proc. ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Liu, Z. Wu, D. Ritchie, W. T. Freeman, J. B. Tenenbaum, and J. Wu, “Learning to describe scenes with programs,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
    Findings
  • G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling, “Auto-encoding variational bayes.” in Proc. ICLR, 2013.
    Google ScholarLocate open access versionFindings
  • L. Dinh, D. Krueger, and Y. Bengio, “NICE: non-linear independent components estimation,” in Proc. ICLR Workshops, 2015.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in NeurIPS, 2018, pp. 10 236–10 245.
    Google ScholarFindings
  • A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” in Proc. NIPS, 2016, pp. 4797–4805.
    Google ScholarLocate open access versionFindings
  • A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in Proc. ICML, 2016.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. ICML, 2017.
    Google ScholarLocate open access versionFindings
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in Proc. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proc. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • M. Mirza and S. Osindero, “Conditional generative adversarial nets,” 2014, arXiv:1411.1784.
    Findings
  • P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. CVPR, 2017, pp. 5967–5976.
    Google ScholarLocate open access versionFindings
  • J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • K. O. Stanley, “Compositional pattern producing networks: A novel abstraction of development,” Genetic programming and evolvable machines, vol. 8, no. 2, pp. 131–162, 2007.
    Google ScholarLocate open access versionFindings
  • A. Mordvintsev, N. Pezzotti, L. Schubert, and C. Olah, “Differentiable image parameterizations,” Distill, vol. 3, no. 7, p. e12, 2018.
    Google ScholarLocate open access versionFindings
  • X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in Proc. ICANN, 2011.
    Google ScholarLocate open access versionFindings
  • A. Yuille and D. Kersten, “Vision as Bayesian inference: analysis by synthesis?” Trends in Cognitive Sciences, vol. 10, pp. 301–308, 2006.
    Google ScholarLocate open access versionFindings
  • T. Bever and D. Poeppel, “Analysis by synthesis: A (re-)emerging program of research for language and vision,” Biolinguistics, vol. 4, no. 2, pp. 174–200, 2010.
    Google ScholarLocate open access versionFindings
  • T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • J. Yang, S. Reed, M.-H. Yang, and H. Lee, “Weakly-supervised disentangling with recurrent transformations for 3d view synthesis,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. K. Mansinghka, “Picture: A probabilistic programming language for scene perception,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • H. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki, “Adversarial inverse graphics networks: Learning
    Google ScholarFindings
  • Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University
    Google ScholarFindings
  • J. C. Hart, “Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces,” The Visual Computer, vol. 12, no. 10, pp. 527–545, 1996.
    Google ScholarLocate open access versionFindings
  • A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” in Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance. Ieee, 2009, pp. 296–301.
    Google ScholarLocate open access versionFindings
  • C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML. JMLR. org, 2017, pp. 1126–1135.
    Google ScholarLocate open access versionFindings
Author
Michael Zollh&ouml;fer
Michael Zollh&ouml;fer
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科