Text-based Editing of Talking-head Video

ACM Transactions on Graphics (TOG), pp. 68-14, 2019.

Cited by: 57|Views196
EI
Weibo:
Viseme search depends on the size of the input video and the new edit

Abstract:

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our m...More

Code:

Data:

0
Introduction
  • Talking-head video – framed to focus on the face and upper body of a speaker – is ubiquitous in movies, TV shows, commercials, YouTube video logs, and online lectures.
  • Using current video editing tools, like Adobe Premiere, skilled editors typically scrub through raw video footage to find relevant segments and assemble them into the desired story
  • They must carefully consider where to place cuts so as to minimize disruptions to the overall audio-visual flow.
  • An edit operation W will often change the length of the original video.
  • The retimed sequence B does not match the original nor the new audio, but can provide realistic background pixels and pose parameters that seamlessly blend into the rest of the video.
  • In a later step the authors synthesize frames based on the retimed background and expression parameters that do match the audio
Highlights
  • Talking-head video – framed to focus on the face and upper body of a speaker – is ubiquitous in movies, TV shows, commercials, YouTube video logs, and online lectures
  • This paper presents a method that completes the suite of operations necessary for transcript-based editing of talking-head video
  • Viseme search depends on the size of the input video and the new edit
  • For a 1 hour recording with continuous speech, viseme search takes between 10 minutes and 2 hours for all word insertion operations in this paper
  • We presented the first approach that enables text-based editing of talking-head video by modifying the corresponding transcript
  • We obtain errors of 0.018 using 100%, 0.019 using 50% and 0.021 using only 5% of the data (R,G,B ∈ [0, 1])
  • Our approach enables a large variety of edits, such as addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
Methods
  • The authors' system takes as input a video recording of a talking head with a transcript of the speech and any number of edit operations specified on the transcript.
  • Deleting the word “wonderful” in the sequence “hello wonderful world” is specified as (‘hello’, ‘world’) and adding the word “big” is specified as (‘hello’, ‘big’, ‘world’)
  • The authors' system processes these inputs in five main stages (Figure 2).
  • In the phoneme alignment stage (Section 3.1) the authors align the transcript to the video at the level of phonemes and in the tracking and reconstruction stage (Section 3.2) the authors register a 3D parametric head model with the video
  • These are pre-processing steps performed once per input video.
  • The authors' viseme search and approach for combining shorter subsequences with parameter blending is motivated by the phoneme/viseme distribution of the English language (Appendix A)
Results
  • The authors show results for the full approach on a variety of videos, both recorded by ourselves and downloaded from YouTube (Section 4).
  • 3D face reconstruction takes 110ms per frame.
  • Phoneme alignment takes 20 minutes for a 1 hour speech video.
  • Network training takes 42 hours.
  • The authors train for 600K iteration steps with a batch size of 1.
  • Viseme search depends on the size of the input video and the new edit.
  • For a 1 hour recording with continuous speech, viseme search takes between 10 minutes and 2 hours for all word insertion operations in this paper.
  • Neural face rendering takes 132ms per frame.
  • All other steps of the pipeline incur a negligible time penalty
Conclusion
  • The authors presented the first approach that enables text-based editing of talking-head video by modifying the corresponding transcript.
  • The authors' approach enables a large variety of edits, such as addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.
  • The authors believe the approach is a first important step towards the goal of fully text-based editing and synthesis of general audio-visual content
Summary
  • Introduction:

    Talking-head video – framed to focus on the face and upper body of a speaker – is ubiquitous in movies, TV shows, commercials, YouTube video logs, and online lectures.
  • Using current video editing tools, like Adobe Premiere, skilled editors typically scrub through raw video footage to find relevant segments and assemble them into the desired story
  • They must carefully consider where to place cuts so as to minimize disruptions to the overall audio-visual flow.
  • An edit operation W will often change the length of the original video.
  • The retimed sequence B does not match the original nor the new audio, but can provide realistic background pixels and pose parameters that seamlessly blend into the rest of the video.
  • In a later step the authors synthesize frames based on the retimed background and expression parameters that do match the audio
  • Objectives:

    Given an edit operation specified as a sequence of words W, the goal is to find matching sequences of phonemes in the video that can be combined to produce W.
  • Facial expressions δ ∈ R64 are the most important parameters for the task, as they hold information about mouth and face movement — the visemes the authors aim to reproduce.
  • The authors' goal is to preserve the retrieved expression parameters as much as possible, while smoothing out the transition between them
  • Methods:

    The authors' system takes as input a video recording of a talking head with a transcript of the speech and any number of edit operations specified on the transcript.
  • Deleting the word “wonderful” in the sequence “hello wonderful world” is specified as (‘hello’, ‘world’) and adding the word “big” is specified as (‘hello’, ‘big’, ‘world’)
  • The authors' system processes these inputs in five main stages (Figure 2).
  • In the phoneme alignment stage (Section 3.1) the authors align the transcript to the video at the level of phonemes and in the tracking and reconstruction stage (Section 3.2) the authors register a 3D parametric head model with the video
  • These are pre-processing steps performed once per input video.
  • The authors' viseme search and approach for combining shorter subsequences with parameter blending is motivated by the phoneme/viseme distribution of the English language (Appendix A)
  • Results:

    The authors show results for the full approach on a variety of videos, both recorded by ourselves and downloaded from YouTube (Section 4).
  • 3D face reconstruction takes 110ms per frame.
  • Phoneme alignment takes 20 minutes for a 1 hour speech video.
  • Network training takes 42 hours.
  • The authors train for 600K iteration steps with a batch size of 1.
  • Viseme search depends on the size of the input video and the new edit.
  • For a 1 hour recording with continuous speech, viseme search takes between 10 minutes and 2 hours for all word insertion operations in this paper.
  • Neural face rendering takes 132ms per frame.
  • All other steps of the pipeline incur a negligible time penalty
  • Conclusion:

    The authors presented the first approach that enables text-based editing of talking-head video by modifying the corresponding transcript.
  • The authors' approach enables a large variety of edits, such as addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.
  • The authors believe the approach is a first important step towards the goal of fully text-based editing and synthesis of general audio-visual content
Tables
  • Table1: Grouping phonemes (listed as ARPABET codes) into visemes. We use the viseme grouping of Annosoft’s lipsync tool [<a class="ref-link" id="cAnnosoft_2008_a" href="#rAnnosoft_2008_a">Annosoft 2008</a>]. More viseme groups may lead to better visual matches (each group is more specific in its appearance), but require more data because the chance to find a viseme match decreases. We did not perform an extensive evaluation of different viseme groupings, of which there are many
  • Table2: Input sequences. We recorded three sequences, each about 1 hour long. The sequences contain ground truth sentences and test sentences we edit, and also the first 500 sentences from the TIMIT dataset. We also downloaded a 1.5 hour long interview from YouTube that contains camera and hand motion, and an erroneous transcript. Seq2 and Seq3 are both 60fps. Seq1 was recorded at 240fps, but since our method produces reasonable results with lower frame rates, we discarded frames and effectively used 60fps. Seq4 is 25fps, and still produces good results
  • Table3: We performed a user study with N = 138 participants and collected in total 2993 responses to evaluate the quality of our approach. Participants were asked to respond to the statement “This video clip looks real to me” on a 5-point Likert scale from 1 (strongly disagree) to 5 (strongly agree). We give the percentage for each score, the average score, and the percentage of cases the video was rated as ‘real’ (a score of 4 or higher). The difference between conditions is statistically significant (Kruskal-Wallis test, p < 10−30). Our results are different from both GT-base and from GT-target (Tukey’s honest significant difference procedure, p < 10−9 for both tests). This suggests that while our results are often rated as real, they are still not on par with real video
Download tables as Excel
Related work
Funding
  • This work was supported by the Brown Institute for Media Innovation, the Max Planck Center for Visual Computing and Communications, ERC Consolidator Grant 4DRepLy (770784), Adobe Systems, and the Office of the Dean for Research at Princeton University
Study subjects and analysis
participants: 138
Input sequences. We recorded three sequences, each about 1 hour long. The sequences contain ground truth sentences and test sentences we edit, and also the first 500 sentences from the TIMIT dataset. We also downloaded a 1.5 hour long interview from YouTube that contains camera and hand motion, and an erroneous transcript. Seq2 and Seq3 are both 60fps. Seq1 was recorded at 240fps, but since our method produces reasonable results with lower frame rates, we discarded frames and effectively used 60fps. Seq4 is 25fps, and still produces good results. We performed a user study with N = 138 participants and collected in total 2993 responses to evaluate the quality of our approach. Participants were asked to respond to the statement “This video clip looks real to me” on a 5-point Likert scale from 1 (strongly disagree) to 5 (strongly agree). We give the percentage for each score, the average score, and the percentage of cases the video was rated as ‘real’ (a score of 4 or higher). The difference between conditions is statistically significant (Kruskal-Wallis test, p < 10−30). Our results are different from both GT-base and from GT-target (Tukey’s honest significant difference procedure, p < 10−9 for both tests). This suggests that while our results are often rated as real, they are still not on par with real video. We propose a novel text-based editing approach for talking-head video. Given an edited transcript, our approach produces a realistic output video in which the dialogue of the speaker has been modified and the resulting video maintains a seamless audio-visual flow (i.e. no jump cuts)

participants: 138
5.7 User Study. To quantitatively evaluate the quality of videos generated by our text-based editing system, we performed a web-based user study with N = 138 participants and collected 2993 individual responses, see Table 3. The study includes videos of two different talking heads, Set 1 and Set 2, where each set contains 6 different base sentences

Reference
  • Annosoft. 2008. Lipsync Tool. http://www.annosoft.com/docs/Visemes17.html Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen.2017.
    Findings
  • Bringing Portraits to Life. ACM Transactions on Graphics (SIGGRAPH Asia) 36, 6 (November 2017), 196:1–13. https://doi.org/10.1145/3130800.3130818
    Locate open access versionFindings
  • Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-GAN: Unsupervised Video Retargeting. In ECCV. Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for Placing Cuts and Transitions in Interview Video. ACM Trans. Graph. 31, 4, Article 67 (July 2012), 8 pages. https://doi.org/10.1145/2185520.2185563
    Locate open access versionFindings
  • Volker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. 200Exchanging Faces in Images. Computer Graphics Forum (Eurographics) 23, 3 (September 2004), 669–676. https://doi.org/10.1111/j.1467-8659.2004.00799.x
    Locate open access versionFindings
  • Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 187–194. https://doi.org/10.1145/311535.311556
    Locate open access versionFindings
  • James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. 2018. Large Scale 3D Morphable Models. International Journal of Computer Vision 126, 2 (April 2018), 233–254. https://doi.org/10.1007/s11263-017-1009-7
    Locate open access versionFindings
  • Christoph Bregler, Michele Covell, and Malcolm Slaney. 199Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’97). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 353–360. https://doi.org/10.1145/258734.258880
    Locate open access versionFindings
  • Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High-fidelity Facial Performance Capture. ACM Transactions on Graphics (SIGGRAPH) 34, 4 (July 2015), 46:1–9. https://doi.org/10.1145/2766943
    Locate open access versionFindings
  • Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. 2018. Everybody Dance Now. arXiv e-prints (August 2018). arXiv:1808.07371 Yao-Jen Chang and Tony Ezzat. 2005. Transferable Videorealistic Speech Animation. In Symposium on Computer Animation (SCA). 143–151. https://doi.org/10.1145/1073368.1073388 Qifeng Chen and Vladlen Koltun.2017. Photographic Image Synthesis with Cascaded Refinement Networks. In International Conference on Computer Vision (ICCV).1520– 1529.https://doi.org/10.1109/ICCV.2017.168
    Findings
  • Pengfei Dou, Shishir K. Shah, and Ioannis A. Kakadiaris. 2017. End-To-End 3D Face Reconstruction With Deep Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animatorcentric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graph. 35, 4, Article 127 (July 2016), 11 pages. https://doi.org/10.1145/2897824.2925984
    Locate open access versionFindings
  • Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Transactions on Graphics (SIGGRAPH) 21, 3 (July 2002), 388–398. https://doi.org/10.1145/566654.566594
    Locate open access versionFindings
  • Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (December 2014), 8:1–14.
    Google ScholarLocate open access versionFindings
  • J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. 1993. DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. http://www.ldc.upenn.edu/Catalog/LDC93S1.html Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Pérez, and Christian Theobalt.2014. Automatic Face Reenactment. In CVPR.4217–4224.https://doi.org/10.1109/CVPR.2014.537
    Locate open access versionFindings
  • Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2015. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track. Computer Graphics Forum (Eurographics) 34, 2 (May 2015), 193–204. https://doi.org/10.1111/cgf.12552
    Locate open access versionFindings
  • Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigs from Monocular Video. ACM Transactions on Graphics 35, 3 (June 2016), 28:1–https://doi.org/10.1145/2890493
    Locate open access versionFindings
  • Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. 2018. Warpguided GANs for Single-photo Facial Animation. In SIGGRAPH Asia 2018 Technical Papers (SIGGRAPH Asia ’18). ACM, New York, NY, USA, Article 231, 231:1– 231:12 pages. http://doi.acm.org/10.1145/3272127.3275043 Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T. Freeman.2018. Unsupervised Training for 3D Morphable Model Regression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Locate open access versionFindings
  • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Y. Guo, J. Zhang, J. Cai, B. Jiang, and J. Zheng. 20CNN-based Real-time Dense Face Reconstruction with Inverse-rendered Photo-realistic Face Images. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1. https://doi.org/10.1109/ TPAMI.2018.2837742
    Locate open access versionFindings
  • Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, Vol. 1. IEEE, 373–376.
    Google ScholarLocate open access versionFindings
  • IBM. 2016. IBM Speech to Text Service. https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/speech-to-text/. Accessed 2016-12-17.
    Findings
  • Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D Avatar Creation from Hand-held Video Input. ACM Transactions on Graphics (SIGGRAPH) 34, 4 (July 2015), 45:1–14. https://doi.org/10.1145/2766974
    Locate open access versionFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). 5967–5976. https://doi.org/10.1109/CVPR.2017.632
    Locate open access versionFindings
  • Zeyu Jin, Gautham J Mysore, Stephen Diverdi, Jingwan Lu, and Adam Finkelstein. 2017. VoCo: text-based insertion and replacement in audio narration. ACM Transactions on Graphics (TOG) 36, 4 (2017), 96.
    Google ScholarLocate open access versionFindings
  • Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Ira Kemelmacher-Shlizerman. 2013. Internet-Based Morphable Model. In International Conference on Computer Vision (ICCV). 3256–3263.
    Google ScholarLocate open access versionFindings
  • Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010. Being John Malkovich. In European Conference on Computer Vision (ECCV). 341–353. https://doi.org/10.1007/978-3-642-15549-9_25
    Locate open access versionFindings
  • Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018a. Deep Video Portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), 163.
    Google ScholarLocate open access versionFindings
  • H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. 2018b. Deep Video Portraits. ACM Transactions on Graphics 2018 (TOG) (2018).
    Google ScholarLocate open access versionFindings
  • Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational Video Editing for Dialogue-driven Scenes. ACM Trans. Graph. 36, 4, Article 130 (July 2017), 14 pages. https://doi.org/10.1145/3072959.3073653
    Locate open access versionFindings
  • Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707–710.
    Google ScholarLocate open access versionFindings
  • Kai Li, Qionghai Dai, Ruiping Wang, Yebin Liu, Feng Xu, and Jue Wang. 2014. A DataDriven Approach for Facial Expression Retargeting in Video. IEEE Transactions on Multimedia 16, 2 (February 2014), 299–310.
    Google ScholarLocate open access versionFindings
  • Kang Liu and Joern Ostermann. 2011. Realistic facial expression synthesis for an imagebased talking head. In International Conference on Multimedia and Expo (ICME). https://doi.org/10.1109/ICME.2011.6011835
    Locate open access versionFindings
  • L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, W. Wang, and C. Theobalt. 2018. Neural Animation and Reenactment of Human Actor Videos. ArXiv e-prints (September 2018). arXiv:1809.03658
    Findings
  • Zicheng Liu, Ying Shan, and Zhengyou Zhang. 2001. Expressive Expression Mapping with Ratio Images. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 271–276. https://doi.org/10.1145/383259.383289
    Locate open access versionFindings
  • Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, Adarsh Kowdle, Christoph Rhemann, Dan B Goldman, Cem Keskin, Steve Seitz, Shahram Izadi, and Sean Fanello. 2018. LookinGood: Enhancing Performance Capture with Real-time Neural Re-rendering. ACM Trans. Graph. 37, 6, Article 255 (December 2018), 14 pages.
    Google ScholarLocate open access versionFindings
  • Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2010. Optimized photorealistic audiovisual speech synthesis using active appearance modeling. In Auditory-Visual Speech Processing. 8–1.
    Google ScholarLocate open access versionFindings
  • Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. (2014). https://arxiv.org/abs/1411.1784 arXiv:1411.1784.
    Findings
  • Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: Real-time Avatars Using Dynamic Textures. In SIGGRAPH Asia 2018 Technical Papers (SIGGRAPH Asia ’18). ACM, New York, NY, USA, Article 258, 12 pages. https://doi.org/10.1145/3272127.3275075
    Locate open access versionFindings
  • Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/. Accessed 2018-09-25.
    Findings
  • Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017. Realistic Dynamic Facial Textures from a Single Image using GANs. In International Conference on Computer Vision (ICCV). 5439–5448. https://doi.org/10.1109/ICCV.2017.580
    Locate open access versionFindings
  • Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2016. VidCrit: Video-based Asynchronous Video Review. In Proc. of UIST. ACM, 517–528.
    Google ScholarLocate open access versionFindings
  • Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video Digests: A Browsable, Skimmable Format for Informational Lecture Videos. In Proc. of UIST. 573–582.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D Face Reconstruction by Learning from Synthetic Data. In International Conference on 3D Vision (3DV). 460– 469. https://doi.org/10.1109/3DV.2016.56
    Locate open access versionFindings
  • Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning Detailed Face Reconstruction from a Single Image. In Conference on Computer Vision and Pattern Recognition (CVPR). 5553–5562. https://doi.org/10.1109/CVPR.2017.589
    Locate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). 234–241.
    Google ScholarLocate open access versionFindings
  • Joseph Roth, Yiying Tong Tong, and Xiaoming Liu. 2017. Adaptive 3D Face Reconstruction from Unconstrained Photo Collections. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (November 2017), 2127–2141. https://doi.org/10.1109/TPAMI.2016.2636829
    Locate open access versionFindings
  • Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. In Proceedings of the 26th annual ACM symposium on User interface software and technology. 113–122.
    Google ScholarLocate open access versionFindings
  • Matan Sela, Elad Richardson, and Ron Kimmel. 2017. Unrestricted Facial Geometry Reconstruction Using Image-to-Image Translation. In International Conference on Computer Vision (ICCV). 1585–1594. https://doi.org/10.1109/ICCV.2017.175
    Locate open access versionFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP. IEEE, 4779–4783.
    Google ScholarLocate open access versionFindings
  • Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014. Automatic Acquisition of High-fidelity Facial Performances Using Monocular Videos. ACM Transactions on Graphics (SIGGRAPH Asia) 33, 6 (November 2014), 222:1–13. https://doi.org/10.1145/2661229.2661290
    Locate open access versionFindings
  • Hijung Valentina Shin, Wilmot Li, and Frédo Durand. 2016. Dynamic Authoring of Audio with Linked Scripts. In Proc. of UIST. 509–516.
    Google ScholarLocate open access versionFindings
  • Qianru Sun, Ayush Tewari, Weipeng Xu, Mario Fritz, Christian Theobalt, and Bernt Schiele. 2018. A Hybrid Model for Identity Obfuscation by Face Replacement. In European Conference on Computer Vision (ECCV).
    Google ScholarLocate open access versionFindings
  • Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 36, 4, Article 95 (July 2017), 13 pages. https://doi.org/10.1145/3072959.3073640
    Locate open access versionFindings
  • Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 36, 4, Article 93 (July 2017), 11 pages. https://doi.org/10.1145/3072959.3073699
    Locate open access versionFindings
  • Ayush Tewari, Michael Zollhöfer, Florian Bernard, Pablo Garrido, Hyeongwoo Kim, Patrick Perez, and Christian Theobalt. 2018a. High-Fidelity Monocular Face Reconstruction based on an Unsupervised Model-based Face Autoencoder. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1. https://doi.org/10.1109/TPAMI.2018.2876842
    Locate open access versionFindings
  • Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian Theobalt. 2018b. Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Pérez, and Christian Theobalt. 2017. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In ICCV. 3735–3744. https://doi.org/10.1109/ICCV.2017.401
    Findings
  • Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. In Conference on Computer Vision and Pattern Recognition (CVPR). 2387–2395. https://doi.org/10.1109/CVPR.2016.262
    Locate open access versionFindings
  • Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. 2017. Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network. In Conference on Computer Vision and Pattern Recognition (CVPR). 1493–1502. https://doi.org/10.1109/CVPR.2017.163
    Locate open access versionFindings
  • Anh Truong, Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2016. Quickcut: An interactive tool for editing narrated video. In Proc. of UIST. 497–507.
    Google ScholarLocate open access versionFindings
  • Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. In SSW. 125.
    Google ScholarFindings
  • Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Transactions on Graphics (SIGGRAPH) 24, 3 (July 2005), 426–433. https://doi.org/10.1145/1073204.1073209
    Locate open access versionFindings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
    Google ScholarLocate open access versionFindings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In CVPR.
    Google ScholarFindings
  • O. Wiles, A.S. Koepke, and A. Zisserman. 2018. X2Face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision.
    Google ScholarLocate open access versionFindings
  • Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. The Journal of the Acoustical Society of America 123, 5 (2008), 3878–3878. https://doi.org/10.1121/1.2935783 arXiv:https://doi.org/10.1121/1.2935783
    Locate open access versionFindings
  • Heiga Zen, Keiichi Tokuda, and Alan W Black. 2009. Statistical parametric speech synthesis. speech communication 51, 11 (2009), 1039–1064.
    Google ScholarFindings
  • Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven Animator-centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 161:1–161:10 pages.
    Google ScholarLocate open access versionFindings
  • M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt. 2018. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum (Eurographics State of the Art Reports 2018) 37, 2 (2018).
    Google ScholarLocate open access versionFindings
  • A PHONEME & VISEME CONTENT Our matching algorithm (Section 3.3) is designed to find the longest match between subsequences of phonemes/visemes in the edit and the input video. Suppose our input video consists of all the sentences in the TIMIT corpus [Garofolo et al. 1993], a set that has been designed to be phonetically rich by acoustic-phonetic reseachers. Figure 17 plots the probability of finding an exact match anywhere in TIMIT to a phoneme/viseme subsequence of length K ∈ [1, 10]. Exact matches of more than 4-6 visemes or 3-5 phonemes are rare. This result suggests that even with phonetically rich input video we cannot expect to find edits consisting of long sequences of phonemes/visemes (e.g. multiword insertions) in the input video and that our approach of combining shorter subsequences with parameter blending is necessary.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments