AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), (2019): 1119-1130
- A major driver behind recent work on generative models has been the promise of unsupervised discovery of powerful neural scene representations, enabling downstream tasks ranging from robotic manipulation and few-shot 3D reconstruction to navigation.
- Multi-view geometry and projection operations are performed by a black-box neural renderer, which is expected to learn these operations from data
- As a result, such approaches fail to discover 3D structure under limited training data, lack guarantees on multi-view consistency of the rendered images, and learned representations are generally not interpretable.
- These scene representations are discrete, limiting achievable spatial resolution, only sparsely sampling the underlying smooth surfaces of a scene, and often require explicit 3D supervision
- A major driver behind recent work on generative models has been the promise of unsupervised discovery of powerful neural scene representations, enabling downstream tasks ranging from robotic manipulation and few-shot 3D reconstruction to navigation
- We introduce Scene Representation Networks (SRNs), a continuous neural scene representation, along with a differentiable rendering algorithm, that model both 3D scene geometry and appearance, enforce 3D structure in a multi-view consistent manner, and naturally allow generalization of shape and appearance priors across scenes
- We introduce SRNs, a 3D-structured neural scene representation that implicitly represents a scene as a continuous, differentiable function
- This function maps 3D coordinates to a feature-based representation of the scene and can be trained end-to-end with a differentiable ray marcher to render the feature-based representation into a set of 2D images
- SRNs could be explored in a probabilistic framework [2, 3], enabling sampling of feasible scenes given a set of observations
- As SRNs are differentiable with respect to camera parameters; future work may alternatively integrate them with learned algorithms for camera pose estimation 
- The authors train SRNs on several object classes and evaluate them for novel view synthesis and few-shot reconstruction.
- Please see the supplement for a comparison on single-scene novel view synthesis performance with DeepVoxels .
- Hyperparameters, computational complexity, and full network architectures for SRNs and all baselines are in the supplement.
- Training of the presented models takes on the order of 6 days.
- A single forward pass takes around 120 ms and 3 GB of GPU memory per batch item.
- The authors introduce SRNs, a 3D-structured neural scene representation that implicitly represents a scene as a continuous, differentiable function
- This function maps 3D coordinates to a feature-based representation of the scene and can be trained end-to-end with a differentiable ray marcher to render the feature-based representation into a set of 2D images.
- SRNs could be extended to model view- and lighting-dependent effects, translucency, and participating media.
- As SRNs are differentiable with respect to camera parameters; future work may alternatively integrate them with learned algorithms for camera pose estimation .
- Please see the supplemental material for further details on directions for future work
- Table1: PSNR (in dB) and SSIM of images reconstructed with our method, the deterministic variant of the GQN [<a class="ref-link" id="c2" href="#r2">2</a>] (dGQN), the model proposed by Tatarchenko et al [<a class="ref-link" id="c1" href="#r1">1</a>] (TCO), and the method proposed by Worrall et al [<a class="ref-link" id="c4" href="#r4">4</a>] (WRL). We compare novel-view synthesis performance on objects in the training set (containing 50 images of each object), as well as reconstruction from 1 or 2 images on the held-out test set
- Our approach lies at the intersection of multiple fields. In the following, we review related work.
Geometric Deep Learning. Geometric deep learning has explored various representations to reason about scene geometry. Discretization-based techniques use voxel grids [7, 16,17,18,19,20,21,22], octree hierarchies [23,24,25], point clouds [11, 26, 27], multiplane images , patches , or meshes [15, 21, 30, 31]. Methods based on function spaces continuously represent space as the decision boundary of a learned binary classifier  or a continuous signed distance field [33,34,35]. While these techniques are successful at modeling geometry, they often require 3D supervision, and it is unclear how to efficiently infer and represent appearance. Our proposed method encapsulates both scene geometry and appearance, and can be trained end-to-end via learned differentiable rendering, supervised only with posed 2D images.
- Vincent Sitzmann was supported by a Stanford Graduate Fellowship
- Michael Zollhöfer was supported by the Max Planck Center for Visual Computing and Communication (MPC-VCC)
- Gordon Wetzstein was supported by NSF awards (IIS 1553333, CMMI 1839974), by a Sloan Fellowship, by an Okawa Research Grant, and a PECASE
- M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Single-view to multi-view: Reconstructing unseen views with a convolutional network,” CoRR abs/1511.06702, vol. 1, no. 2, p. 2, 2015.
- S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor et al., “Neural scene representation and rendering,” Science, vol. 360, no. 6394, pp. 1204–1210, 2018.
- A. Kumar, S. A. Eslami, D. Rezende, M. Garnelo, F. Viola, E. Lockhart, and M. Shanahan, “Consistent jumpy predictions for videos and scenes,” 2018.
- D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Interpretable transformations with encoder-decoder networks,” in Proc. ICCV, vol. 4, 2017.
- D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IROS, September 2015, p. 922 – 928.
- V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer, “Deepvoxels: Learning persistent 3d feature embeddings,” in Proc. CVPR, 2019.
- A. Kar, C. Häne, and J. Malik, “Learning a multi-view stereo machine,” in Proc. NIPS, 2017, pp. 365–376.
- H.-Y. F. Tung, R. Cheng, and K. Fragkiadaki, “Learning spatial common sense with geometry-aware recurrent networks,” Proc. CVPR, 2019.
- T. H. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang, “Rendernet: A deep convolutional network for differentiable rendering from 3d shapes,” in Proc. NIPS, 2018.
- J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, and B. Freeman, “Visual object networks: image generation with disentangled 3d representations,” in Proc. NIPS, 2018, pp. 118–129.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” Proc. CVPR, 2017.
- E. Insafutdinov and A. Dosovitskiy, “Unsupervised learning of shape and pose with differentiable point clouds,” in Proc. NIPS, 2018, pp. 2802–2812.
- M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla, “Neural rerendering in the wild,” Proc. CVPR, 2019.
- C.-H. Lin, C. Kong, and S. Lucey, “Learning efficient point cloud generation for dense 3d object reconstruction,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- D. Jack, J. K. Pontes, S. Sridharan, C. Fookes, S. Shirazi, F. Maire, and A. Eriksson, “Learning free-form deformations for 3d object reconstruction,” CoRR, 2018.
- S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervision for single-view reconstruction via differentiable ray consistency,” in Proc. CVPR.
- J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” in Proc. NIPS, 2016, pp. 82–90.
- M. Gadelha, S. Maji, and R. Wang, “3d shape induction from 2d views of multiple objects,” in 3DV. IEEE Computer Society, 2017, pp. 402–411.
- C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas, “Volumetric and multi-view cnns for object classification on 3d data,” in Proc. CVPR, 2016.
- X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman, “Pix3d: Dataset and methods for single-image 3d shape modeling,” in Proc. CVPR, 2018.
- D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3d structure from images,” in Proc. NIPS, 2016.
- C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified approach for single and multi-view 3d object reconstruction,” in Proc. ECCV, 2016.
- G. Riegler, A. O. Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions,” in Proc. CVPR, 2017.
- M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs,” in Proc. ICCV, 2017, pp. 2107–2115.
- C. Haene, S. Tulsiani, and J. Malik, “Hierarchical surface prediction,” Proc. PAMI, pp. 1–1, 2019.
- P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning representations and generative models for 3D point clouds,” in Proc. ICML, 2018, pp. 40–49.
- M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Multi-view 3d models from single images with a convolutional network,” in Proc. ECCV, 2016.
- T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: learning view synthesis using multiplane images,” ACM Trans. Graph., vol. 37, no. 4, pp. 65:1–65:12, 2018.
- T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “Atlasnet: A papier-mâché approach to learning 3d surface generation,” in Proc. CVPR, 2018.
- H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” in Proc. CVPR, 2018, pp. 3907–3916.
- A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learning category-specific mesh reconstruction from image collections,” in ECCV, 2018.
- L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proc. CVPR, 2019.
- J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” arXiv preprint arXiv:1901.05103, 2019.
- K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser, “Learning shape templates with structured implicit functions,” Proc. ICCV, 2019.
- B. Deng, K. Genova, S. Yazdani, S. Bouaziz, G. Hinton, and A. Tagliasacchi, “Cvxnets: Learnable convex decomposition,” arXiv preprint arXiv:1909.05736, 2019.
- T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang, “Hologan: Unsupervised learning of 3d representations from natural images,” in Proc. ICCV, 2019.
- F. Alet, A. K. Jeewajee, M. Bauza, A. Rodriguez, T. Lozano-Perez, and L. P. Kaelbling, “Graph element networks: adaptive, structured computation and memory,” in Proc. ICML, 2019.
- Y. Liu, Z. Wu, D. Ritchie, W. T. Freeman, J. B. Tenenbaum, and J. Wu, “Learning to describe scenes with programs,” in Proc. ICLR, 2019.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
- G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes.” in Proc. ICLR, 2013.
- L. Dinh, D. Krueger, and Y. Bengio, “NICE: non-linear independent components estimation,” in Proc. ICLR Workshops, 2015.
- D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in NeurIPS, 2018, pp. 10 236–10 245.
- A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” in Proc. NIPS, 2016, pp. 4797–4805.
- A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in Proc. ICML, 2016.
- I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NIPS, 2014.
- M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. ICML, 2017.
- T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in Proc. ICLR, 2018.
- J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in Proc. ECCV, 2016.
- A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proc. ICLR, 2016.
- M. Mirza and S. Osindero, “Conditional generative adversarial nets,” 2014, arXiv:1411.1784.
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. CVPR, 2017, pp. 5967–5976.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017.
- K. O. Stanley, “Compositional pattern producing networks: A novel abstraction of development,” Genetic programming and evolvable machines, vol. 8, no. 2, pp. 131–162, 2007.
- A. Mordvintsev, N. Pezzotti, L. Schubert, and C. Olah, “Differentiable image parameterizations,” Distill, vol. 3, no. 7, p. e12, 2018.
- X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision,” in Proc. NIPS, 2016.
- M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Proc. NIPS, 2015.
- G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in Proc. ICANN, 2011.
- A. Yuille and D. Kersten, “Vision as Bayesian inference: analysis by synthesis?” Trends in Cognitive Sciences, vol. 10, pp. 301–308, 2006.
- T. Bever and D. Poeppel, “Analysis by synthesis: A (re-)emerging program of research for language and vision,” Biolinguistics, vol. 4, no. 2, pp. 174–200, 2010.
- T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in Proc. NIPS, 2015.
- J. Yang, S. Reed, M.-H. Yang, and H. Lee, “Weakly-supervised disentangling with recurrent transformations for 3d view synthesis,” in Proc. NIPS, 2015.
- T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. K. Mansinghka, “Picture: A probabilistic programming language for scene perception,” in Proc. CVPR, 2015.
- H. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki, “Adversarial inverse graphics networks: Learning
- Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in Proc. CVPR, 2017.
- R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University
- J. C. Hart, “Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces,” The Visual Computer, vol. 12, no. 10, pp. 527–545, 1996.
- A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Proc. NIPS, 2016.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” in Proc. ICLR, 2017.
- P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance. Ieee, 2009, pp. 296–301.
- C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” in Proc. ICLR, 2019.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML. JMLR. org, 2017, pp. 1126–1135.