VideoFlow: A Flow-Based Generative Model for Video

Manoj Kumar
Manoj Kumar
Mohammad Babaeizadeh
Mohammad Babaeizadeh

arXiv: Computer Vision and Pattern Recognition, 2019.

Cited by: 0|Bibtex|Views222
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Our empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art variational auto-encoders models in stochastic video prediction

Abstract:

Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. In particular, learning predictive models of videos offers an especially appealing mechanism to enable a rich understanding of the physical world: videos of real-world int...More

Code:

Data:

0
Introduction
  • Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream.
  • Videos of real-world interactions are plentiful and readily available, and a large generative model can be trained on large unlabeled datasets containing many video sequences, thereby learning about a wide range of real-world phenoma
  • Such a model could be useful for learning representations for further downstream tasks (Mathieu et al, 2016), or could even be used directly in applications where predicting the future enables effective decision making and control, such as robotics (Finn et al, 2016).
  • A number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally, or do not directly optimize the likelihood of the data
Highlights
  • Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream
  • The application of machine learning technology has been largely constrained to situations where large amounts of supervision is available, such as in image classification or machine translation, or where highly accurate simulations of the environment are available to the learning agent, such as in game-playing agents
  • Our empirical results show that VideoFlow achieves results that are competitive with the state-ofthe-art in stochastic video prediction on the action-free BAIR dataset, with quantitative results that rival the best variational auto-encoders (VAEs)-based models
  • Fréchet Video Distance (FVD): We evaluate VideoFlow using the recently proposed Fréchet Video Distance (FVD) metric (Unterthiner et al, 2018), an adaptation of the Fréchet Inception Distance (FID) metric (Heusel et al, 2017) for video generation. (Unterthiner et al, 2018) report results with models trained on a total of 16 frames with 2 conditioning frames; while we train our VideoFlow
  • Even in the settings that are disadvantageous to VideoFlow, where we compute the FVD on a total of 16 frames, when trained on just 13 frames, VideoFlow performs comparable to SAVP
  • Our empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art VAE models in stochastic video prediction
Methods
  • All the generated videos and qualitative results can be viewed at this website. In the generated videos, a border of blue represents the conditioning frame, while a border of red represents the generated frames.

    5.1 VIDEO MODELLING WITH THE STOCHASTIC MOVEMENT DATASET

    The authors use VideoFlow to model the Stochastic Movement Dataset used in (Babaeizadeh et al, 2017).
  • Since the shape moves with a uniform speed, the authors should be able to model the position of the shape at the (t + 1)th step using only the position of the shape at the tth step
  • Using this insight, the authors extract random temporal patches of 2 frames from each video of 3 frames.
  • On generating videos conditioned on the first frame, the authors observe that the model consistently predicts the future trajectory of the shape to be one of the eight random directions.
  • The authors inform the rater that a "real" trajectory is one in which the shape is consistent in color and congruent
Conclusion
  • The authors describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation Kingma & Dhariwal (2018), which the authors call VideoFlow.
  • The authors introduce a latent dynamical system model that predicts future values of the flow model’s latent state replacing the standard unconditional prior distribution.
  • The authors' empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art VAE models in stochastic video prediction.
  • The authors plan to incorporate memory in VideoFlow to model arbitrary long-range dependencies and apply the model to challenging downstream tasks
Summary
  • Introduction:

    Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream.
  • Videos of real-world interactions are plentiful and readily available, and a large generative model can be trained on large unlabeled datasets containing many video sequences, thereby learning about a wide range of real-world phenoma
  • Such a model could be useful for learning representations for further downstream tasks (Mathieu et al, 2016), or could even be used directly in applications where predicting the future enables effective decision making and control, such as robotics (Finn et al, 2016).
  • A number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally, or do not directly optimize the likelihood of the data
  • Methods:

    All the generated videos and qualitative results can be viewed at this website. In the generated videos, a border of blue represents the conditioning frame, while a border of red represents the generated frames.

    5.1 VIDEO MODELLING WITH THE STOCHASTIC MOVEMENT DATASET

    The authors use VideoFlow to model the Stochastic Movement Dataset used in (Babaeizadeh et al, 2017).
  • Since the shape moves with a uniform speed, the authors should be able to model the position of the shape at the (t + 1)th step using only the position of the shape at the tth step
  • Using this insight, the authors extract random temporal patches of 2 frames from each video of 3 frames.
  • On generating videos conditioned on the first frame, the authors observe that the model consistently predicts the future trajectory of the shape to be one of the eight random directions.
  • The authors inform the rater that a "real" trajectory is one in which the shape is consistent in color and congruent
  • Conclusion:

    The authors describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation Kingma & Dhariwal (2018), which the authors call VideoFlow.
  • The authors introduce a latent dynamical system model that predicts future values of the flow model’s latent state replacing the standard unconditional prior distribution.
  • The authors' empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art VAE models in stochastic video prediction.
  • The authors plan to incorporate memory in VideoFlow to model arbitrary long-range dependencies and apply the model to challenging downstream tasks
Tables
  • Table1: We compare the realism of the generated trajectories using a real-vs-fake 2AFC Amazon Mechanical Turk with SAVP-VAE and SV2P
  • Table2: Left: We report the average bits-per-pixel across 10 target frames with 3 conditioning frames for the BAIR action-free dataset
  • Table3: Fréchet Video Distance:. We report the mean and standard deviation across 5 runs for 3 different frame settings. Results are not directly comparable across models due to the differences between the total number of frames seen during training and the number of conditioning frames
Download tables as Excel
Related work
Reference
  • Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
    Findings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
    Google ScholarLocate open access versionFindings
  • Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth camera & manipulator from raw execution traces. In International Conference on Robotics and Automation (ICRA), 2014.
    Google ScholarLocate open access versionFindings
  • Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
    Google ScholarLocate open access versionFindings
  • Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. Advances in Neural Information Processing Systems, pp. 247–254, 1995.
    Google ScholarLocate open access versionFindings
  • Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
    Findings
  • Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
    Findings
  • Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
    Findings
  • Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
    Findings
  • Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666, 2016.
    Google ScholarLocate open access versionFindings
  • Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
    Findings
  • Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In International Conference on Robotics and Automation (ICRA), 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in Neural Information Processing Systems, pp. 2211–2221, 2017.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
    Findings
  • Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637, 2017.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8): 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
    Google ScholarLocate open access versionFindings
  • Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.
    Google ScholarLocate open access versionFindings
  • Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
    Findings
  • Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 600–615, 2018.
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. International Conference on Computer Vision (ICCV), 2017.
    Google ScholarLocate open access versionFindings
  • William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    Findings
  • Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. CoRR, abs/1811.00002, 2018. URL http://arxiv.org/abs/1811.00002.
    Findings
  • Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001, 2017.
    Findings
  • MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
    Findings
  • Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664, 2017.
    Findings
  • Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
    Google ScholarLocate open access versionFindings
  • Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1278–1286, 2014.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges, 2018.
    Google ScholarFindings
  • Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016a.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016b.
    Findings
  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. arXiv preprint arXiv:1606.05328, 2016c.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416, 2018.
    Findings
  • Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017a.
    Findings
  • Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3560–3569. JMLR. org, 2017b.
    Google ScholarLocate open access versionFindings
  • Carl Vondrick and Antonio Torralba. Generating the future with adversarial transformers. In Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015.
    Findings
  • Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In International Conference on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004.
    Google ScholarLocate open access versionFindings
  • Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic patterns by spatial-temporal generative convnet. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
    Google ScholarLocate open access versionFindings
  • Jianwen Xie, Ruiqi Gao, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu. Learning dynamic generator model by alternating back-propagation through time. Proceedings of the AAAI Conference on Artificial Intelligence, 33:5498âA S 5507, Jul 2019. ISSN 2159-5399. doi: 10.1609/aaai.v33i01. 33015498. URL http://dx.doi.org/10.1609/aaai.v33i01.33015498.
    Locate open access versionFindings
  • Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):27âA S 45, Jan 2020. ISSN 1939-3539. doi: 10.1109/tpami.2018.2879081. URL http://dx.doi.org/10.1109/TPAMI.2018.2879081.
    Locate open access versionFindings
  • SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint, 2018.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments