VideoFlow: A Flow-Based Generative Model for Video
arXiv: Computer Vision and Pattern Recognition, 2019.
EI
Weibo:
Abstract:
Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. In particular, learning predictive models of videos offers an especially appealing mechanism to enable a rich understanding of the physical world: videos of real-world int...More
Code:
Data:
Introduction
- Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream.
- Videos of real-world interactions are plentiful and readily available, and a large generative model can be trained on large unlabeled datasets containing many video sequences, thereby learning about a wide range of real-world phenoma
- Such a model could be useful for learning representations for further downstream tasks (Mathieu et al, 2016), or could even be used directly in applications where predicting the future enables effective decision making and control, such as robotics (Finn et al, 2016).
- A number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally, or do not directly optimize the likelihood of the data
Highlights
- Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream
- The application of machine learning technology has been largely constrained to situations where large amounts of supervision is available, such as in image classification or machine translation, or where highly accurate simulations of the environment are available to the learning agent, such as in game-playing agents
- Our empirical results show that VideoFlow achieves results that are competitive with the state-ofthe-art in stochastic video prediction on the action-free BAIR dataset, with quantitative results that rival the best variational auto-encoders (VAEs)-based models
- Fréchet Video Distance (FVD): We evaluate VideoFlow using the recently proposed Fréchet Video Distance (FVD) metric (Unterthiner et al, 2018), an adaptation of the Fréchet Inception Distance (FID) metric (Heusel et al, 2017) for video generation. (Unterthiner et al, 2018) report results with models trained on a total of 16 frames with 2 conditioning frames; while we train our VideoFlow
- Even in the settings that are disadvantageous to VideoFlow, where we compute the FVD on a total of 16 frames, when trained on just 13 frames, VideoFlow performs comparable to SAVP
- Our empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art VAE models in stochastic video prediction
Methods
- All the generated videos and qualitative results can be viewed at this website. In the generated videos, a border of blue represents the conditioning frame, while a border of red represents the generated frames.
5.1 VIDEO MODELLING WITH THE STOCHASTIC MOVEMENT DATASET
The authors use VideoFlow to model the Stochastic Movement Dataset used in (Babaeizadeh et al, 2017). - Since the shape moves with a uniform speed, the authors should be able to model the position of the shape at the (t + 1)th step using only the position of the shape at the tth step
- Using this insight, the authors extract random temporal patches of 2 frames from each video of 3 frames.
- On generating videos conditioned on the first frame, the authors observe that the model consistently predicts the future trajectory of the shape to be one of the eight random directions.
- The authors inform the rater that a "real" trajectory is one in which the shape is consistent in color and congruent
Conclusion
- The authors describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation Kingma & Dhariwal (2018), which the authors call VideoFlow.
- The authors introduce a latent dynamical system model that predicts future values of the flow model’s latent state replacing the standard unconditional prior distribution.
- The authors' empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art VAE models in stochastic video prediction.
- The authors plan to incorporate memory in VideoFlow to model arbitrary long-range dependencies and apply the model to challenging downstream tasks
Summary
Introduction:
Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream.- Videos of real-world interactions are plentiful and readily available, and a large generative model can be trained on large unlabeled datasets containing many video sequences, thereby learning about a wide range of real-world phenoma
- Such a model could be useful for learning representations for further downstream tasks (Mathieu et al, 2016), or could even be used directly in applications where predicting the future enables effective decision making and control, such as robotics (Finn et al, 2016).
- A number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally, or do not directly optimize the likelihood of the data
Methods:
All the generated videos and qualitative results can be viewed at this website. In the generated videos, a border of blue represents the conditioning frame, while a border of red represents the generated frames.
5.1 VIDEO MODELLING WITH THE STOCHASTIC MOVEMENT DATASET
The authors use VideoFlow to model the Stochastic Movement Dataset used in (Babaeizadeh et al, 2017).- Since the shape moves with a uniform speed, the authors should be able to model the position of the shape at the (t + 1)th step using only the position of the shape at the tth step
- Using this insight, the authors extract random temporal patches of 2 frames from each video of 3 frames.
- On generating videos conditioned on the first frame, the authors observe that the model consistently predicts the future trajectory of the shape to be one of the eight random directions.
- The authors inform the rater that a "real" trajectory is one in which the shape is consistent in color and congruent
Conclusion:
The authors describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation Kingma & Dhariwal (2018), which the authors call VideoFlow.- The authors introduce a latent dynamical system model that predicts future values of the flow model’s latent state replacing the standard unconditional prior distribution.
- The authors' empirical results show that VideoFlow achieves results that are competitive with the state-of-the-art VAE models in stochastic video prediction.
- The authors plan to incorporate memory in VideoFlow to model arbitrary long-range dependencies and apply the model to challenging downstream tasks
Tables
- Table1: We compare the realism of the generated trajectories using a real-vs-fake 2AFC Amazon Mechanical Turk with SAVP-VAE and SV2P
- Table2: Left: We report the average bits-per-pixel across 10 target frames with 3 conditioning frames for the BAIR action-free dataset
- Table3: Fréchet Video Distance:. We report the mean and standard deviation across 5 runs for 3 different frame settings. Results are not directly comparable across models due to the differences between the total number of frames seen during training and the number of conditioning frames
Related work
- Early work on prediction of future video frames focused on deterministic predictive models (Ranzato et al, 2014; Srivastava et al, 2015; Vondrick et al, 2015; Xingjian et al, 2015; Boots et al, 2014). Much of this research on deterministic models focused on architectural changes, such as predicting high-level structure (Villegas et al, 2017b), energy-based models (Xie et al, 2017), generative cooperative nets (Xie et al, 2020), ABPTT (Xie et al, 2019), incorporating pixel transformations (Finn et al, 2016; De Brabandere et al, 2016; Liu et al, 2017) and predictive coding architectures (Lotter et al, 2017), as well as different generation objectives (Mathieu et al, 2016; Vondrick & Torralba, 2017; Walker et al, 2015) and disentangling representations (Villegas et al, 2017a; Denton & Birodkar, 2017). With models that can successfully model many deterministic environments, the next key challenge is to address stochastic environments by building models that can effectively reason over uncertain futures. Real-world videos are always somewhat stochastic, either due to events that are inherently random, or events that are caused by unobserved or partially observable factors, such as off-screen events, humans and animals with unknown intentions, and objects with unknown physical properties. In such cases, since deterministic models can only generate one future, these models either disregard potential futures or produce blurry predictions that are the superposition or averages of possible futures.
A variety of methods have sought to overcome this challenge by incorporating stochasticity, via three types of approaches: models based on variational auto-encoders (VAEs) (Kingma & Welling, 2013; Rezende et al, 2014), generative adversarial networks (Goodfellow et al, 2014), and autoregressive models (Hochreiter & Schmidhuber, 1997; Graves, 2013; van den Oord et al, 2016b;c; Van Den Oord et al, 2016).
Reference
- Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
- Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth camera & manipulator from raw execution traces. In International Conference on Robotics and Automation (ICRA), 2014.
- Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
- Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. Advances in Neural Information Processing Systems, pp. 247–254, 1995.
- Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
- Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
- Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
- Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666, 2016.
- Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
- Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In International Conference on Robotics and Automation (ICRA), 2017.
- Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, 2016.
- Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in Neural Information Processing Systems, pp. 2211–2221, 2017.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
- Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637, 2017.
- Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8): 1735–1780, 1997.
- Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014.
- Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. International Conference on Machine Learning (ICML), 2017.
- Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
- Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
- Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.
- Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
- Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 600–615, 2018.
- Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. International Conference on Computer Vision (ICCV), 2017.
- William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. International Conference on Learning Representations (ICLR), 2017.
- Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. CoRR, abs/1811.00002, 2018. URL http://arxiv.org/abs/1811.00002.
- Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001, 2017.
- MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
- Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664, 2017.
- Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
- Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1278–1286, 2014.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 2015.
- Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges, 2018.
- Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016a.
- Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016b.
- Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. arXiv preprint arXiv:1606.05328, 2016c.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
- Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416, 2018.
- Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017a.
- Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3560–3569. JMLR. org, 2017b.
- Carl Vondrick and Antonio Torralba. Generating the future with adversarial transformers. In Computer Vision and Pattern Recognition (CVPR), 2017.
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015.
- Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In International Conference on Computer Vision (ICCV), 2015.
- Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004.
- Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic patterns by spatial-temporal generative convnet. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Jianwen Xie, Ruiqi Gao, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu. Learning dynamic generator model by alternating back-propagation through time. Proceedings of the AAAI Conference on Artificial Intelligence, 33:5498âA S 5507, Jul 2019. ISSN 2159-5399. doi: 10.1609/aaai.v33i01. 33015498. URL http://dx.doi.org/10.1609/aaai.v33i01.33015498.
- Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):27âA S 45, Jan 2020. ISSN 1939-3539. doi: 10.1109/tpami.2018.2879081. URL http://dx.doi.org/10.1109/TPAMI.2018.2879081.
- SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, 2015.
- Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, 2016.
- Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint, 2018.
Full Text
Tags
Comments