Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the Wild

Cited by: 0|Views9
Weibo:
The learned model is able to generalize to unseen poses and views, and the learned canonical and motion weight volumes perform well for both Internet and synthetic data

Abstract:

Given an "in-the-wild" video of a person, we reconstruct an animatable model of the person in the video. The output model can be rendered in any body pose to any camera view, via the learned controls, without explicit 3D mesh reconstruction. At the core of our method is a volumetric 3D human representation reconstructed with a deep netw...More

Code:

Data:

0
ZH
Full Text
Bibtex
Weibo
Introduction
  • Modeling and rendering humans is a core technology needed to enable applications in sports visualization, telepresence, shopping, and many others.
  • 3D reconstruction approaches require training on synthetic, ground truth meshes, or 3D scans [30, 31, 3, 12, 17, 42], and image-to-image translation methods focus only on 2D synthesis [6, 38]
  • The authors' method approaches both challenges by reconstructing a volumetric representation of a person that can be animated and rendered in any pose for any viewpoint.
  • An important contribution of the paper is the design of a representation and a corresponding deep network that enable generalization to arbitrary poses and views given these images
Highlights
  • Modeling and rendering humans is a core technology needed to enable applications in sports visualization, telepresence, shopping, and many others
  • While a lot of research has been done for calibrated data, e.g., multiview laboratory setups [32, 21, 20], synthesizing any person just from data “in the wild” is still a challenge. 3D reconstruction approaches require training on synthetic, ground truth meshes, or 3D scans [30, 31, 3, 12, 17, 42], and image-to-image translation methods focus only on 2D synthesis [6, 38]
  • Our method approaches both challenges by reconstructing a volumetric representation of a person that can be animated and rendered in any pose for any viewpoint
  • We find that the canonical volume is similar to what we get when training on multi-view data of just that pose, and the distribution of motion weight volume aligns well with the body parts
  • The learned model is able to generalize to unseen poses and views, and the learned canonical and motion weight volumes perform well for both Internet and synthetic data
  • More frames lead to sharper results
Methods
  • The authors compare the results to related baseline methods, show ablation studies, and results.
  • Comparison with baselines: The authors compare with two baseline methods to justify the design choices in Sec. 3.
  • The authors use synthetic data for the comparison, as it provides clean training signals to all the methods.
  • The authors use a skeleton map, κ, as an input, where the authors project 3D body pose ρi to the image plane of camera ei.
  • The authors represent κ as a stack of layers, where a layer is an image representing a body bone.
  • To keep 3D information in κ, the authors rasterize each bone into its layer by linear-interpolating the z-values of its endpoints
Results
  • The authors show the results for Internet videos in several figures. The learned model is able to generalize to unseen poses and views, and the learned canonical and motion weight volumes (see Fig. 3) perform well for both Internet and synthetic data.
  • The learned model is able to generalize to unseen poses and views, and the learned canonical and motion weight volumes perform well for both Internet and synthetic data.
  • Even trained with a relatively small number of frames (<1,000), the model still performs reasonably well, aided by the decomposition into canonical volume and motion weights.
  • The re-targeting results can be found in Fig. 1, 3, and 4.
  • Fig. 4 shows the result where the authors enforce jumping sequence to all of the models
Conclusion
  • Limitation: The authors' model quality depends on the diversity of poses and views in the training data.
  • Progress with MLP networks for implicit scene reconstruction [29, 33] suggests a path toward increased resolution beyond voxel grids.The authors propose a novel but simple animatable human representation, learned from only image observations, that enables person synthesis from any view with any pose
  • The authors validate it on both synthetic data and Internet videos, and demonstrate applications such as motion retargeting and bullet-time rendering.
  • The authors' approach offers a new way to enable free-viewpoint animatable person reconstruction and rendering, providing an alternative to meshbased representations
Summary
  • Introduction:

    Modeling and rendering humans is a core technology needed to enable applications in sports visualization, telepresence, shopping, and many others.
  • 3D reconstruction approaches require training on synthetic, ground truth meshes, or 3D scans [30, 31, 3, 12, 17, 42], and image-to-image translation methods focus only on 2D synthesis [6, 38]
  • The authors' method approaches both challenges by reconstructing a volumetric representation of a person that can be animated and rendered in any pose for any viewpoint.
  • An important contribution of the paper is the design of a representation and a corresponding deep network that enable generalization to arbitrary poses and views given these images
  • Objectives:

    The authors' goal is to build an image synthesis function Irender(ρ, e) that renders the person in any pose ρ from any viewpoint e.
  • The authors' goal is to build an image synthesis function Irender(ρ, e) that renders the person in any pose ρ from any viewpoint e. the authors aim to solve an optimization problem defined as:
  • Methods:

    The authors compare the results to related baseline methods, show ablation studies, and results.
  • Comparison with baselines: The authors compare with two baseline methods to justify the design choices in Sec. 3.
  • The authors use synthetic data for the comparison, as it provides clean training signals to all the methods.
  • The authors use a skeleton map, κ, as an input, where the authors project 3D body pose ρi to the image plane of camera ei.
  • The authors represent κ as a stack of layers, where a layer is an image representing a body bone.
  • To keep 3D information in κ, the authors rasterize each bone into its layer by linear-interpolating the z-values of its endpoints
  • Results:

    The authors show the results for Internet videos in several figures. The learned model is able to generalize to unseen poses and views, and the learned canonical and motion weight volumes (see Fig. 3) perform well for both Internet and synthetic data.
  • The learned model is able to generalize to unseen poses and views, and the learned canonical and motion weight volumes perform well for both Internet and synthetic data.
  • Even trained with a relatively small number of frames (<1,000), the model still performs reasonably well, aided by the decomposition into canonical volume and motion weights.
  • The re-targeting results can be found in Fig. 1, 3, and 4.
  • Fig. 4 shows the result where the authors enforce jumping sequence to all of the models
  • Conclusion:

    Limitation: The authors' model quality depends on the diversity of poses and views in the training data.
  • Progress with MLP networks for implicit scene reconstruction [29, 33] suggests a path toward increased resolution beyond voxel grids.The authors propose a novel but simple animatable human representation, learned from only image observations, that enables person synthesis from any view with any pose
  • The authors validate it on both synthetic data and Internet videos, and demonstrate applications such as motion retargeting and bullet-time rendering.
  • The authors' approach offers a new way to enable free-viewpoint animatable person reconstruction and rendering, providing an alternative to meshbased representations
Tables
  • Table1: Quantitative results compared to two baselines
Download tables as Excel
Related work
  • Image-to-image translation Recent advances in imageto-image translation [13] have shown convincing rendering performance on motion retargeting in 2D [6, 39, 38, 24, 9, 4]. The idea is directly learning a mapping function from 2D skeleton images to rendering output. Those works are usually not robust to significant view changes due to the lack of 3D reasoning. The follow-up works, [32] and [20, 21] try to overcome this by introducing 3D representations, such as texture maps or pre-built and rigged character models, but both of them need to work on multiple-views laboratory setup. There exists another interesting direction [41] that enables retargeting via video sprite rearrangement but still lacks explicit view controls.

    Novel view synthesis Novel view synthesis (NVS) is an active area and has been achieved exceptional rendering quality [36, 28, 29, 35, 10, 22, 34, 27, 19, 36] recently. In particular, our work is highly related to the methods that uses volumes as intermediate representations [29, 34, 22]. Most of these works focus on static scenes [29, 34]. [22] provides a way to control dynamic scenes implicitly via learning and traversing a latent space. Our work can be thought of as an extension of volume-based approaches on NVS from static scenes to poseable humans with explicit controls. Our representation builds on [22]; they use volume warping fields to improve volume resolution when reconstructing scenes captured with a multi-view rig, while we leverage this idea to constrain the solution space for reconstructing poseable humans in single-view videos.
Funding
  • This work was supported by the UW Reality Lab, Facebook, Google, Futurewei, and Amazon
Reference
  • Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and Gerard Pons-Moll. Learning to reconstruct people in clothing from a single rgb camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1175–1186, 2019. 2
    Google ScholarLocate open access versionFindings
  • Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video based reconstruction of 3d people models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8387– 8397, 2018. 2, 7
    Google ScholarLocate open access versionFindings
  • Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor. Tex2shape: Detailed full human body geometry from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pages 2293– 2303, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8340– 8348, 2018. 2
    Google ScholarLocate open access versionFindings
  • Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008, 2018. 5
    Findings
  • Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision, pages 5933– 5942, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pages 1511–1520, 2016
    Google ScholarLocate open access versionFindings
  • CMU. CMU Graphics Lab Motion Capture Database, 2007. 7
    Google ScholarLocate open access versionFindings
  • Patrick Esser, Ekaterina Sutter, and Bjorn Ommer. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8857–8866, 2018. 2
    Google ScholarLocate open access versionFindings
  • John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2367– 2376, 2019. 2
    Google ScholarLocate open access versionFindings
  • Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen, Geoff Harvey, Sergio OrtsEscolano, Rohit Pandey, Jason Dourgarian, et al. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics (TOG), 38(6):1–19, 2019. 8
    Google ScholarLocate open access versionFindings
  • Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Tony Tung. Arch: Animatable reconstruction of clothed humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2020. 1, 2
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017. 2
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 206
    Findings
  • Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9799–9808, 2020. 5
    Google ScholarLocate open access versionFindings
  • Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE International Conference on Computer Vision, pages 2252–2261, 2019. 5
    Google ScholarLocate open access versionFindings
  • Verica Lazova, Eldar Insafutdinov, and Gerard Pons-Moll. 360-degree textures of people in clothing from a single image. In 2019 International Conference on 3D Vision (3DV), pages 643–653. IEEE, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • John P Lewis, Matt Cordner, and Nickson Fong. Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 165–172, 2000. 3
    Google ScholarLocate open access versionFindings
  • Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the plenoptic function. arXiv preprint arXiv:2007.15194, 2020. 2
    Findings
  • Lingjie Liu, Weipeng Xu, Marc Habermann, Michael Zollhoefer, Florian Bernard, Hyeongwoo Kim, Wenping Wang, and Christian Theobalt. Neural human video rendering by learning dynamic textures and rendering-to-video translation. IEEE Transactions on Visualization and Computer Graphics, 201, 2
    Google ScholarLocate open access versionFindings
  • Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo Kim, Florian Bernard, Marc Habermann, Wenping Wang, and Christian Theobalt. Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG), 38(5):1–14, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019. 1, 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multiperson linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015. 4
    Google ScholarLocate open access versionFindings
  • Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In Advances in neural information processing systems, pages 406–416, 2017. 2
    Google ScholarLocate open access versionFindings
  • Abhimitra Meka, Christian Haene, Rohit Pandey, Michael Zollhofer, Sean Fanello, Graham Fyffe, Adarsh Kowdle, Xueming Yu, Jay Busch, Jason Dourgarian, et al. Deep reflectance fields: high-quality facial reflectance field inference from color gradient illumination. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019. 8
    Google ScholarLocate open access versionFindings
  • Abhimitra Meka, Rohit Pandey, Christian Haene, Sergio Orts-Escolano, Peter Barnum, Philip Davidson, Daniel Erickson, Yinda Zhang, Jonathan Taylor, Sofien Bouaziz, Chloe Legendre, Wan-Chun Ma, Ryan Overbeck, Thabo Beeler, Paul Debevec, Shahram Izadi, Christian Theobalt, Christoph Rhemann, and Sean Fanello. Deep relightable textures - volumetric performance capture with neural rendering. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia), 39(6), December 2020. 8
    Google ScholarLocate open access versionFindings
  • Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, and Ricardo MartinBrualla. Neural rerendering in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6878–6887, 2019. 2
    Google ScholarLocate open access versionFindings
  • Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019. 2
    Google ScholarLocate open access versionFindings
  • Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934, 2020. 2, 8
    Findings
  • Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2304–2314, 2019. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 84–93, 2020. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • Aliaksandra Shysheya, Egor Zakharov, Kara-Ali Aliev, Renat Bashirov, Egor Burkov, Karim Iskakov, Aleksei Ivakhnenko, Yury Malkov, Igor Pasechnik, Dmitry Ulyanov, et al. Textured neural avatars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2397, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33, 2020. 8
    Google ScholarLocate open access versionFindings
  • Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019. 2
    Google ScholarLocate open access versionFindings
  • Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 175–184, 2019. 2
    Google ScholarLocate open access versionFindings
  • Justus Thies, Michael Zollhofer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019. 2
    Google ScholarLocate open access versionFindings
  • Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018. 2
    Google ScholarLocate open access versionFindings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-tovideo synthesis. arXiv preprint arXiv:1808.06601, 2018. 1, 2
    Findings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Chung-Yi Weng, Brian Curless, and Ira KemelmacherShlizerman. Photo wake-up: 3d character animation from a single photo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5908– 5917, 2019. 2
    Google ScholarLocate open access versionFindings
  • Haotian Zhang, Cristobal Sciutto, Maneesh Agrawala, and Kayvon Fatahalian. Vid2player: Controllable video sprites that behave and appear like professional tennis players. arXiv preprint arXiv:2008.04524, 2020. 2
    Findings
  • Luyang Zhu, Konstantinos Rematas, Brian Curless, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Reconstructing nba players. In European Conference on Computer Vision, pages 177–194. Springer, 2020. 1, 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments