Neural Sparse Voxel Fields

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views244
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We present Neural Sparse Voxel Fields that consists of a set of voxel-bounded implicit fields, where for each voxel, voxel embeddings are learned to encode local properties for high-quality rendering;

Abstract:

Photo-realistic free-viewpoint rendering of real-world scenes using classical computer graphics techniques is challenging, because it requires the difficult step of capturing detailed appearance and geometry models. Recent studies have demonstrated promising results by learning scene representations that implicitly encode both geometry ...More
0
Introduction
  • Realistic rendering in computer graphics has a wide range of applications including mixed reality, visual effects, visualization, and even training data generation in computer vision and robot navigation.
  • Researchers have developed image-based rendering (IBR) approaches that combine vision-based scene geometry modeling with image-based view interpolation (Shum and Kang, 2000; Zhang and Chen, 2004; Szeliski, 2010)
  • Despite their significant progress, IBR approaches still have sub-optimal rendering quality and limited control over the results, and are often scene-type specific.
Highlights
  • Realistic rendering in computer graphics has a wide range of applications including mixed reality, visual effects, visualization, and even training data generation in computer vision and robot navigation
  • We present Neural Sparse Voxel Fields (NSVF) that consists of a set of voxel-bounded implicit fields, where for each voxel, voxel embeddings are learned to encode local properties for high-quality rendering;
  • Existing neural scene representations and neural rendering methods commonly aim to learn a function that maps a spatial location to a feature representation that implicitly describes the local geometry and appearance of the scene, where novel views of that scene can be synthesized using rendering techniques in computer graphics
  • In this paper we show how hierarchical sparse volume representations can be used in a neural network-encoded implicit field of a 3D scene to enable detailed encoding, and efficient, high quality differentiable volumetric rendering, even of large scale scenes
  • Our experiments show that 10k ∼ 100k sparse voxels in the NSVF representation are enough for photo-realistic rendering of complex scenes
  • Extensive experiments show that NSVF is over 10 times faster than the state-of-the-art while achieving better quality
Methods
  • The authors evaluate the proposed NSVF on six datasets. The authors provide qualitative and quantitative comparisons to three recent methods on four datasets and show the results on several challenging tasks including multi-scene learning, rendering of dynamic and large-scale indoor scenes, and scene editing and composition.
  • The rendered images are blended with the real images to have realistic ambient lighting with a resolution of 768 × 576
Results
  • Results Quality Comparison

    The authors show the qualitative comparisons in Figure 4.
  • NSVF can achieve photo-realistic results on various kinds of scenes with complex geometry, thin structures and lighting effects.
  • Note that NSVF with early termination ( = 0.01) produces almost the same quality as NSVF without early termination.
  • This indicates that early termination would not cause noticeable quality degradation while significantly accelerating computation, as will be seen
Conclusion
  • The authors propose NSVF, a hybrid neural scene representations for fast and high-quality free-viewpoint rendering.
  • The proposed representation enables much faster rendering than the state-of-the-art, and enables more convenient scene editing and compositing
  • This new approach to 3D scene modeling and rendering from images complements and partially improves over established computer graphics concepts, and opens up new possibilities in many applications, such as mixed reality, visual effects, and training data generation for computer vision tasks.
  • At the same time it shows new ways to learn spatially-aware scene representations of potential relevance in other domains, such as object scene understanding, object recognition, robot navigation, or training data generation for image-based reconstruction
Summary
  • Introduction:

    Realistic rendering in computer graphics has a wide range of applications including mixed reality, visual effects, visualization, and even training data generation in computer vision and robot navigation.
  • Researchers have developed image-based rendering (IBR) approaches that combine vision-based scene geometry modeling with image-based view interpolation (Shum and Kang, 2000; Zhang and Chen, 2004; Szeliski, 2010)
  • Despite their significant progress, IBR approaches still have sub-optimal rendering quality and limited control over the results, and are often scene-type specific.
  • Methods:

    The authors evaluate the proposed NSVF on six datasets. The authors provide qualitative and quantitative comparisons to three recent methods on four datasets and show the results on several challenging tasks including multi-scene learning, rendering of dynamic and large-scale indoor scenes, and scene editing and composition.
  • The rendered images are blended with the real images to have realistic ambient lighting with a resolution of 768 × 576
  • Results:

    Results Quality Comparison

    The authors show the qualitative comparisons in Figure 4.
  • NSVF can achieve photo-realistic results on various kinds of scenes with complex geometry, thin structures and lighting effects.
  • Note that NSVF with early termination ( = 0.01) produces almost the same quality as NSVF without early termination.
  • This indicates that early termination would not cause noticeable quality degradation while significantly accelerating computation, as will be seen
  • Conclusion:

    The authors propose NSVF, a hybrid neural scene representations for fast and high-quality free-viewpoint rendering.
  • The proposed representation enables much faster rendering than the state-of-the-art, and enables more convenient scene editing and compositing
  • This new approach to 3D scene modeling and rendering from images complements and partially improves over established computer graphics concepts, and opens up new possibilities in many applications, such as mixed reality, visual effects, and training data generation for computer vision tasks.
  • At the same time it shows new ways to learn spatially-aware scene representations of potential relevance in other domains, such as object scene understanding, object recognition, robot navigation, or training data generation for image-based reconstruction
Tables
  • Table1: The quantitative comparisons on test sets of four datasets. We use three metrics: PSNR (↑), SSIM (↑) and LPIPS (↓) (<a class="ref-link" id="cZhang_et+al_2018_a" href="#rZhang_et+al_2018_a">Zhang et al, 2018</a>) to evaluate the rendering quality. Scores are averaged over the testing images of all scenes, and we present the per-scene breakdown results in Appendix. By default, NSVF is executed with early termination ( = 0.01). We also show results without using early termination ( = 0) denoted as NSVF0
  • Table2: Ablation for progressive training
  • Table3: Detailed breakdown of quantitative metrics of individual scenes for all 4 datasets for our method and 3 baselines. All scores are averaged over the testing images
  • Table4: Effect of voxel size on the wineholder test set
Download tables as Excel
Related work
  • Neural Rendering Recent works have shown impressive results by replacing or augmenting the traditional graphics rendering with neural networks, which is typically referred to as neural rendering. We refer the reader to recent surveys for neural rendering (Tewari et al, 2020; Kato et al, 2020).

    • Novel View Synthesis with 3D inputs: DeepBlending (Hedman et al, 2018) predicts blending weights for the image-based rendering on a geometric proxy. Other methods (Thies et al, 2019; Kim et al, 2018; Liu et al, 2019a, 2020; Meshry et al, 2019; Martin Brualla et al, 2018; Aliev et al, 2019) first render a given geometry with explicit or neural textures into coarse RGB images or feature maps which are then translated into high-quality images. However, these works need 3D geometry as input and the performance would be affected by the quality of the geometry.

    • Novel View Synthesis without 3D inputs: Other approaches learn scene representations for novel-view synthesis from 2D images. Generative Query Networks (GQN) (Eslami et al, 2018) learn a vectorized embedding of a 3D scene and render it from novel views. However, they do not learn geometric scene structure as explicitly as NSVF, and their renderings are rather coarse. Following-up works learned more 3D-structure aware representations and accompanying renderers (Flynn et al, 2016; Zhou et al, 2018; Mildenhall et al, 2019) with Multiplane Images (MPIs) as proxies, which only render a restricted range of novel views interpolating input views. RenderNet (Nguyen-Phuoc et al, 2018) and its follow-up works (Nguyen-Phuoc et al, 2019; Liu et al, 2019c) use a CNN-based decoder for differentiable rendering to render a scene represented as coarse-grained voxel grids. However, this CNN-based decorder cannot ensure view consistency due to 2D convolution kernels. To enforce view consistency, the other research line (Lombardi et al, 2019; Sitzmann et al, 2019a) use classical rendering techniques for differentiable rendering of the learnt scene represented as fine-grained voxel grids which makes scene structure more explicit but limits achievable spatial resolution. SRN (Sitzmann et al, 2019b) and NeRF (Mildenhall et al, 2020) introduce a neural implicit function to model the entire scene. However, their results are either blurry or suffer from slow rendering process. In addition, these approaches do not easily permit scene editing and composition. The proposed NSVF allows efficient and higher-quality novel view synthesis even of larger scenes, and enables scene editing and composition.
Funding
  • This work was partially supported by ERC Consolidator Grant 770784 and Lise Meitner Postdoctoral Fellowship
Study subjects and analysis
datasets: 6
We also demonstrate several challenging tasks, including multi-scene learning, free-viewpoint rendering of a moving human, and large-scale scene rendering. We evaluate the proposed NSVF on six datasets. We provide qualitative and quantitative comparisons to three recent methods on four datasets and show our results on several challenging tasks including multi-scene learning, rendering of dynamic and large-scale indoor scenes, and scene editing and composition

datasets: 6
An illustration of self-pruning and progressive training is shown in Figure 3. We evaluate the proposed NSVF on six datasets. We provide qualitative and quantitative comparisons to three recent methods on four datasets and show our results on several challenging tasks including multi-scene learning, rendering of dynamic and large-scale indoor scenes, and scene editing and composition

datasets: 4
This indicates that early termination would not cause noticeable quality degradation while significantly accelerating computation, as will be seen next. Speed Comparison We provide speed comparisons on the models of four datasets in Figure 5 where we merge the results of Synthetic-NeRF and Synthetic-NSVF in the same figure considering their image sizes are the same. For our method, the average rendering time is correlated to the average ratio of foreground to background as shown in Figure 5 (a)-(c)

datasets: 4
. The quantitative comparisons on test sets of four datasets. We use three metrics: PSNR (↑), SSIM (↑) and LPIPS (↓) (Zhang et al, 2018) to evaluate the rendering quality. Scores are averaged over the testing images of all scenes, and we present the per-scene breakdown results in Appendix. By default, NSVF is executed with early termination ( = 0.01). We also show results without using early termination ( = 0) denoted as NSVF0. Ablation for progressive training

datasets: 4
Ablation for progressive training. Detailed breakdown of quantitative metrics of individual scenes for all 4 datasets for our method and 3 baselines. All scores are averaged over the testing images. Effect of voxel size on the wineholder test set

Reference
  • Kara-Ali Aliev, Dmitry Ulyanov, and Victor Lempitsky. 2019. Neural point-based graphics. arXiv preprint arXiv:1906.08240.
    Findings
  • Z. Chen and H. Zhang. 2019. Learning implicit fields for generative shape modeling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5932–5941.
    Google ScholarLocate open access versionFindings
  • Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE.
    Google ScholarLocate open access versionFindings
  • SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. 2018. Neural scene representation and rendering. Science, 360(6394):1204–1210.
    Google ScholarLocate open access versionFindings
  • John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and Richard Tucker. 2019. Deepview: View synthesis with learned gradient descent. International Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 201Deepstereo: Learning to predict new views from the world’s imagery. In Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • David Ha, Andrew Dai, and Quoc Le. 2016. Hypernetworks.
    Google ScholarFindings
  • Eric Haines. 1989. Essential Ray Tracing Algorithms, page 33–77. Academic Press Ltd., GBR.
    Google ScholarFindings
  • Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph., 37(6):257:1– 257:15.
    Google ScholarLocate open access versionFindings
  • Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. 2020. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Hiroharu Kato, Deniz Beker, Mihai Morariu, Takahiro Ando, Toru Matsuoka, Wadim Kehl, and Adrien Gaidon. 2020. Differentiable rendering: A survey. arXiv preprint arXiv:2006.12057.
    Findings
  • Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollöfer, and Christian Theobalt. 2018. Deep video portraits. ACM Transactions on Graphics (TOG), 37.
    Google ScholarLocate open access versionFindings
  • Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4).
    Google ScholarLocate open access versionFindings
  • Samuli Laine and Tero Karras. 2010. Efficient sparse voxel octrees–analysis, extensions, and implementation.
    Google ScholarFindings
  • Lingjie Liu, Weipeng Xu, Marc Habermann, Michael Zollhöfer, Florian Bernard, Hyeongwoo Kim, Wenping Wang, and Christian Theobalt. 2020. Neural human video rendering by learning dynamic textures and rendering-to-video translation. IEEE Transactions on Visualization and Computer Graphics, PP:1–1.
    Google ScholarLocate open access versionFindings
  • Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo Kim, Florian Bernard, Marc Habermann, Wenping Wang, and Christian Theobalt. 2019a. Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG).
    Google ScholarLocate open access versionFindings
  • Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. 2019b. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. arXiv preprint arXiv:1911.13225.
    Findings
  • Shichen Liu, Weikai Chen, Tianye Li, and Hao Li. 2019c. Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. arXiv preprint arXiv:1901.05567.
    Findings
  • Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. 2019d. Learning to infer implicit surfaces without 3d supervision. In Advances in Neural Information Processing Systems, pages 8295–8306.
    Google ScholarLocate open access versionFindings
  • Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019. Neural volumes: Learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4):65.
    Google ScholarLocate open access versionFindings
  • Ricardo Martin Brualla, Peter Lincoln, Adarsh Kowdle, Christoph Rhemann, Dan Goldman, Cem Keskin, Steve Seitz, Shahram Izadi, Sean Fanello, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, and Anastasia Tkach. 2018. Lookingood: Enhancing performance capture with real-time neural re-rendering. volume 37.
    Google ScholarFindings
  • Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-Brualla. 2019. Neural rerendering in the wild. In Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Mateusz Michalkiewicz, Jhony K. Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders Eriksson. 2019. Implicit surface representations as layers in neural networks. In The IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. 2019. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14.
    Google ScholarLocate open access versionFindings
  • Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2020. Nerf: Representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934.
    Findings
  • Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. 2019. Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE International Conference on Computer Vision, pages 7588–7597.
    Google ScholarLocate open access versionFindings
  • Thu H Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. Rendernet: A deep convolutional network for differentiable rendering from 3d shapes. In Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2019. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. arXiv preprint arXiv:1912.07372.
    Findings
  • Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. International Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Songyou Peng, Michael Niemeyer, Lars M. Mescheder, Marc Pollefeys, and Andreas Geiger. 2020. Convolutional occupancy networks. ArXiv, abs/2003.04618.
    Findings
  • Steven M Rubin and Turner Whitted. 1980. A 3-dimensional representation for fast rendering of complex scenes. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques, pages 110–116.
    Google ScholarLocate open access versionFindings
  • Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2304–2314.
    Google ScholarLocate open access versionFindings
  • Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. 2020. Pifuhd: Multi-level pixelaligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Harry Shum and Sing Bing Kang. 2000. Review of image-based rendering techniques. In Visual Communications and Image Processing 2000, volume 4067, pages 2 – 13. International Society for Optics and Photonics, SPIE.
    Google ScholarLocate open access versionFindings
  • Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Niessner, Gordon Wetzstein, and Michael Zollhofer. 2019a. Deepvoxels: Learning persistent 3d feature embeddings. In Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019b. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, pages 1119–1130.
    Google ScholarLocate open access versionFindings
  • Richard Szeliski. 2010. Computer vision: algorithms and applications. Springer Science & Business Media.
    Google ScholarFindings
  • A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B Goldman, and M. Zollhöfer. 2020. State of the Art on Neural Rendering. Computer Graphics Forum (EG STAR 2020).
    Google ScholarLocate open access versionFindings
  • Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics, 38.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in neural information processing systems, pages 1696–1704.
    Google ScholarLocate open access versionFindings
  • Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. 2020. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Cha Zhang and Tsuhan Chen. 2004. A survey on image-based rendering—representation, sampling and compression. Signal Processing: Image Communication, 19(1):1–28.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.
    Google ScholarFindings
  • Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. 2018. Stereo magnification: Learning view synthesis using multiplane images. In SIGGRAPH.
    Google ScholarFindings
  • Synthetic-NeRF. We use the NeRF (Mildenhall et al., 2020) synthetic dataset which includes eight objects rendered with path tracing. Each object is rendered to produce 100 views for training and 200 for testing at 800 × 800 pixels.
    Google ScholarFindings
  • – Robot(CC-BY-SA) https://www.blendswap.com/blend/10597 – Bike(CC-BY)https://www.blendswap.com/blend/8850 – Palace(CC-BY-NC-SA)https://www.blendswap.com/blend/14878 – Spaceship(CC-BY)https://www.blendswap.com/blend/5349 – Lifestyle(CC-BY)https://www.blendswap.com/blend/8909
    Locate open access versionFindings
  • BlendedMVS. We test on four objects of a recent synthetic MVS dataset, BlendedMVS (Yao et al., 2020) 3. The rendered images are blended with the real images to have realistic ambient lighting. The image resolution is 768 × 576. One eighth of the images are held out as test sets.
    Google ScholarFindings
  • Tanks & Temples. We evaluate on five objects of Tanks and Temples (Knapitsch et al., 2017) 4 real scene dataset. We label the object masks ourselves with the software of Altizure 5, and sample One eighth of the images for testing. The image resolution is 1920 × 1080.
    Google ScholarFindings
  • ScanNet. We use two real scenes of an RGB-D video dataset for large-scale indoor scenes, ScanNet (Dai et al., 2017)6. We extract both the RGB and depth images of which we randomly sample 20% as training set and use the rest for testing. The image is scaled to 640 × 480.
    Google ScholarFindings
  • Architecture The proposed model assigns a 32-dimentional learnable voxel embedding to each vertex, and applies positional encoding with maximum frequency as L = 6 (Mildenhall et al., 2020) to the feature embedding aggregated by eight voxel embeddings of the corresponding voxel via trilinear interpolation. As a comparison, we also train our model without positional encoding where we set the voxel embedding dimension d = 416 in order to have comparable feature vectors as the complete model. We use around 1000 initial voxels for each scene. The final number of voxels after pruning and progressive training varies from 10k to 100k (the exact number of voxels differs scene by scene due to varying sizes and shapes), with an effective number of 0.32 ∼ 3.2M learnable parameters in our default voxel embedding settings.
    Google ScholarFindings
  • The overall network architecture of our default model is illustrated in Figure 10 with ∼ 0.5M parameters, not including voxel embeddings. Note that, our implementation of the MLP is slightly shallower than many of the existing works (Sitzmann et al., 2019b; Niemeyer et al., 2019; Mildenhall et al., 2020). By utilizing the voxel embeddings to store local information in a distributed way, we argue that it is sufficient to learn a small MLP to gather voxel information and make accurate predictions.
    Google ScholarLocate open access versionFindings
  • 3https://github.com/YoYo000/BlendedMVS 4https://tanksandtemples.org/download/ 5https://github.com/altizure/altizure-sdk-offline 6http://www.scan-net.org/ 7https://github.com/pytorch/fairseq
    Findings
Full Text
Your rating :
0

 

Tags
Comments