Layered Neural Rendering for Retiming People in Video

ACM Trans. Graph., pp. 1-14, 2020.

Cited by: 1|Views48
EI
Weibo:
We present a method for retiming people in an ordinary, natural video---manipulating and editing the time in which different motions of individuals in the video occur

Abstract:

We present a method for retiming people in an ordinary, natural video---manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from ...More

Code:

Data:

0
Introduction
  • By manipulating the timing of people’s movements, the authors can achieve a variety of effects that can change the perception of an event recorded in a video.
  • The authors' method can “erase” selected people from the video
  • All these effects are achieved via a novel deep neural network-based model that learns a layered decomposition of the input video, which is the pillar of the method.
  • The input for layer i at time t is the sampled deep texture map Tti , which consists of person i’s sampled texture placed over the sampled background texture.
Highlights
  • By manipulating the timing of people’s movements, we can achieve a variety of effects that can change our perception of an event recorded in a video
  • We can temporally align different motions, change the speed of certain actions, or “erase” selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video
  • Our method can “erase” selected people from the video. All these effects are achieved via a novel deep neural network-based model that learns a layered decomposition of the input video, which is the pillar of our method
  • All the videos depict multiple people moving simultaneously and span a wide range of human actions in complex natural environments. Representative frames from these videos are shown in Figs. 1, 3 and 8, and the full input and output sequences are available in the supplementary material
  • We have presented a system for retiming people in video, and demonstrated various effects, including speeding up or slowing down different individuals’ motions, or removing or freezing the motion of a single individual in the scene
  • The core of our technique is learned layer decomposition in which each layer represents the full appearance of an individual in the video — not just the person itself and all space-time visual effects correlated with them, including the movement of the individual’s clothing, and even challenging semi-transparent effects such as shadows and reflections
Methods
  • The authors add an additional background layer L0t , not associated with any person, that learns the background color.
  • Given this layered representation and a back-to-front ordering for the layers, denoted by ot , each frame of the video can be rendered using the standard “over” operator [Porter and Duff 1984].
Results
  • The authors tested the method on a number of real-world videos, most of which are captured by hand-held cellphone cameras.
  • All the videos depict multiple people moving simultaneously and span a wide range of human actions in complex natural environments.
  • Representative frames from these videos are shown in Figs.
  • Depending on video length and number of predicted layers, total training time took between 5 and 12 hours on 2 NVIDIA Tesla P100 GPUs. See Appendix A.1 for further implementation details
Conclusion
  • The authors have presented a system for retiming people in video, and demonstrated various effects, including speeding up or slowing down different individuals’ motions, or removing or freezing the motion of a single individual in the scene.
  • The authors believe that the layered neural rendering approach holds great promise for additional types of synthesis techniques, and the authors plan to generalize it to other objects besides people, and to expand it to other non-trivial post-processing effects, such as stylized rendering of different video components
  • The core of the technique is learned layer decomposition in which each layer represents the full appearance of an individual in the video — not just the person itself and all space-time visual effects correlated with them, including the movement of the individual’s clothing, and even challenging semi-transparent effects such as shadows and reflections.
Summary
  • Introduction:

    By manipulating the timing of people’s movements, the authors can achieve a variety of effects that can change the perception of an event recorded in a video.
  • The authors' method can “erase” selected people from the video
  • All these effects are achieved via a novel deep neural network-based model that learns a layered decomposition of the input video, which is the pillar of the method.
  • The input for layer i at time t is the sampled deep texture map Tti , which consists of person i’s sampled texture placed over the sampled background texture.
  • Objectives:

    The authors aim to achieve such effects computationally by retiming people in everyday videos.
  • Given an input video V , the goal is to decompose each frame It ∈ V into a set of RGBA layers:
  • Methods:

    The authors add an additional background layer L0t , not associated with any person, that learns the background color.
  • Given this layered representation and a back-to-front ordering for the layers, denoted by ot , each frame of the video can be rendered using the standard “over” operator [Porter and Duff 1984].
  • Results:

    The authors tested the method on a number of real-world videos, most of which are captured by hand-held cellphone cameras.
  • All the videos depict multiple people moving simultaneously and span a wide range of human actions in complex natural environments.
  • Representative frames from these videos are shown in Figs.
  • Depending on video length and number of predicted layers, total training time took between 5 and 12 hours on 2 NVIDIA Tesla P100 GPUs. See Appendix A.1 for further implementation details
  • Conclusion:

    The authors have presented a system for retiming people in video, and demonstrated various effects, including speeding up or slowing down different individuals’ motions, or removing or freezing the motion of a single individual in the scene.
  • The authors believe that the layered neural rendering approach holds great promise for additional types of synthesis techniques, and the authors plan to generalize it to other objects besides people, and to expand it to other non-trivial post-processing effects, such as stylized rendering of different video components
  • The core of the technique is learned layer decomposition in which each layer represents the full appearance of an individual in the video — not just the person itself and all space-time visual effects correlated with them, including the movement of the individual’s clothing, and even challenging semi-transparent effects such as shadows and reflections.
Related work
  • Video retiming. Our technique applies time warps (either designed manually, or produced algorithmically) to people in the video, and re-renders the video to match the desired retiming. As such it is related to a large body of work in computer vision and graphics that performs temporal remapping of videos for a variety of tasks. For example, [Bennett and McMillan 2007] sample the frames of an input video non-uniformly to produce computational time-lapse videos with desired objectives, such as minimizing or maximizing the resemblance of consecutive frames. [Zhou et al 2014] use motion-based saliency to nonlinearly retime a video such that more “important” events in it occupy more time. [Davis and Agrawala 2018] retime a video such that the motions (visual rhythm) in the time-warped video match the beat of a target music.

    Other important tasks related to video retiming are video summarization (e.g., [Lan et al 2018]) and fast-forwarding [Joshi et al 2015; Poleg et al 2015; Silva et al 2018], where frames are sampled from an input video to produce shorter summaries or videos with reduced camera motion or shake; or interactive manipulation of objects in a video (e.g., using tracked 2D object motion for posing different objects from a video into a still frame that never actually occurred [Goldman et al 2008].)

    Most of these papers retime entire video frames, by dropping or sampling frames. In contrast, we focus on people, and people’s motions, and our effect is applied at the person/layer level. While many methods exist for processing the video in the sub-frame or patch level — both for retiming (e.g., [Goldman et al 2008; Pritch et al 2008]) and for various other video manipulation tasks, such as object removal, infinite video looping, etc. (e.g., [Agarwala et al 2005; Barnes et al 2010; Wexler et al 2007]) — none of these works
Funding
  • This work was funded in part by the EPSRC Programme Grant Seebibyte EP/M013774/1
Reference
  • Kfir Aberman, Mingyi Shi, Jing Liao, Dani Lischinski, Baoquan Chen, and Daniel CohenOr. 2019. Deep Video-Based Performance Cloning. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 219–233.
    Google ScholarLocate open access versionFindings
  • Aseem Agarwala, Colin Zheng, Chris Pal, Maneesh Agrawala, Michael Cohen, Brian Curless, David Salesin, and Richard Szeliski. 2005. Panoramic Video Textures. In SIGGRAPH.
    Google ScholarFindings
  • Jean-Baptiste Alayrac, João Carreira, and Andrew Zisserman. 2019a. The Visual Centrifuge: Model-Free Layered Video Representations. In CVPR.
    Google ScholarFindings
  • Jean-Baptiste Alayrac, Joao Carreira, Relja Arandjelovic, and Andrew Zisserman. 2019b. Controllable Attention for Structured Layered Video Decomposition. In ICCV.
    Google ScholarFindings
  • Xue Bai, Jue Wang, David Simons, and Guillermo Sapiro. 2009. Video SnapCut: robust video object cutout using localized classifiers. TOG (2009).
    Google ScholarFindings
  • Connelly Barnes, Dan B Goldman, Eli Shechtman, and Adam Finkelstein. 2010. Video Tapestries with Continuous Temporal Zoom. SIGGRAPH (2010).
    Google ScholarLocate open access versionFindings
  • Eric P Bennett and Leonard McMillan. 200Computational time-lapse video. In ACM SIGGRAPH 2007 papers. 102–es.
    Google ScholarLocate open access versionFindings
  • Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, and Irfan Essa. 201Let’s Dance: Learning From Online Dance Videos. In eprint arXiv:2139179.
    Google ScholarLocate open access versionFindings
  • Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 201Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision. 5933–5942.
    Google ScholarLocate open access versionFindings
  • Yung-Yu Chuang, Aseem Agarwala, Brian Curless, David Salesin, and Richard Szeliski. 2002. Video matting of complex scenes. In SIGGRAPH.
    Google ScholarLocate open access versionFindings
  • Abe Davis and Maneesh Agrawala. 2018. Visual Rhythm and Beat. ACM Trans. Graph. 37, 4 (2018), 122–1.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multiperson Pose Estimation. In ICCV.
    Google ScholarFindings
  • Oran Gafni, Lior Wolf, and Yaniv Taigman. 2020. Vid2Game: Controllable Characters Extracted from Real-World Videos. In ICLR.
    Google ScholarFindings
  • Yossi Gandelsman, Assaf Shocher, and Michal Irani. 2019. “Double-DIP”: Unsupervised Image Decomposition via Coupled Deep-Image-Priors. In CVPR.
    Google ScholarFindings
  • Dan B. Goldman, Chris Gonterman, Brian Curless, David Salesin, and Steven M. Seitz. 2008. Video Object Annotation, Navigation, and Composition. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (Monterey, CA, USA) (UIST âĂŹ08). Association for Computing Machinery, New York, NY, USA, 3âĂŞ12. https://doi.org/10.1145/1449715.1449719
    Locate open access versionFindings
  • Matthias Grundmann, Vivek Kwatra, and Irfan Essa. 2011. Auto-directed video stabilization with robust l1 optimal camera paths. In CVPR 2011. IEEE, 225–232.
    Google ScholarLocate open access versionFindings
  • Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7297–7306.
    Google ScholarLocate open access versionFindings
  • Qiqi Hou and Feng Liu. 2019. Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation. In ICCV.
    Google ScholarFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR.
    Google ScholarFindings
  • Njegica Jojic and B.J. Frey. 2001. Learning flexible sprites in video layers. In CVPR. Neel Joshi, Wolf Kienzle, Mike Toelle, Matt Uyttendaele, and Michael F Cohen. 2015.
    Google ScholarFindings
  • Real-time hyperlapse creation via optimal frame selection. ACM Transactions on Graphics (TOG) 34, 4 (2015), 1–9.
    Google ScholarLocate open access versionFindings
  • Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16. Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, and et al. 2018. LookinGood: Enhancing Performance Capture with RealTime Neural Re-Rendering. ACM Trans. Graph. 37, 6, Article 255 (Dec. 2018), 14 pages. https://doi.org/10.1145/3272127.3275099
    Locate open access versionFindings
  • James McCann, Nancy S Pollard, and Siddhartha Srinivasa. 2006. Physics-based motion retiming. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation. Eurographics Association, 205–214. Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-Brualla. 2019. Neural rerendering in the wild.
    Google ScholarLocate open access versionFindings
  • In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6878–6887. Ajay Nandoriya, Elgharib Mohamed, Changil Kim, Mohamed Hefeeda, and Wojciech Matusik. 2017. Video Reflection Removal Through Spatio-Temporal Optimization. In ICCV. Yair Poleg, Tavi Halperin, Chetan Arora, and Shmuel Peleg. 2015. Egosampling: Fastforward and stereo for egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4768–4776. Thomas Porter and Tom Duff. 1984. Compositing Digital Images. SIGGRAPH Comput. Graph. 18, 3 (Jan. 1984), 253âĂŞ259. https://doi.org/10.1145/964965.808606
    Locate open access versionFindings
  • Yael Pritch, Alex Rav-Acha, and Shmuel Peleg. 2008. Nonchronological video synopsis and indexing. IEEE transactions on pattern analysis and machine intelligence 30, 11 (2008), 1971–1984. Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In 2011 International conference on computer vision. Ieee, 2564–2571. Michel Silva, Washington Ramos, Joao Ferreira, Felipe Chamone, Mario Campos, and Erickson R Nascimento. 2018. A weighted sparse sampling and smoothing frame transition approach for semantic fast-forward first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2383–2392.
    Google ScholarLocate open access versionFindings
  • A.2.1 Neural renderer. The neural renderer architecture is a modified pix2pix network [Isola et al. 2017]: layer type(s)
    Google ScholarLocate open access versionFindings
  • A.2.3 Keypoints-to-UVs. The keypoint-to-UV network is a fully convolutional network that takes in an RGB image of a skeleton and outputs a UV map of the same size. The architecture is the same as the neural renderer architecture, with the exception of the final layer, which is replaced by two heads: 1) a final convolutional layer with 25 output channels to predict body part and background classification, and 2) a convolutional layer with 48 output channels to regress UV coordinates for each of the 24 body parts. As in the DensePose work [Güler et al. 2018], we train the body part classifier with cross-entropy loss and train the predicted UV coordinates with L1 loss. The regression loss on the UV coordinates is only taken into account for a body part if the pixel lies within the specific part, as defined by the ground-truth UV map.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments