Stochastic Variational Video Prediction

international conference on learning representations, 2018.

Cited by: 126|Bibtex|Views219
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Our primary contributions include an effective stochastic prediction method with latent variables, a network architecture that succeeds on natural videos, and a training procedure that provides for stable optimization

Abstract:

Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. M...More

Code:

Data:

0
Introduction
  • Understanding the interaction dynamics of objects and predicting what happens is one of the key capabilities of humans which the authors heavily rely on to make decisions in everyday life (Bubic et al, 2010).
  • A model that can accurately predict future observations of complex sensory modalities such as vision must internally represent the complex dynamics of real-world objects and people, and is more likely to acquire a representation that can be used for a variety of visual perception tasks, such as object tracking and action recognition (Srivastava et al, 2015; Lotter et al, 2017; Denton & Birodkar, 2017)
  • Such models can be inherently useful themselves, for example, to allow an autonomous agent or robot to decide how to interact with the world to bring about a desired outcome (Oh et al, 2015; Finn & Levine, 2017).
  • Deterministic models trained with a mean squared error loss function generate the expected value of all the possibilities for each pixel independently, which is inherently blurry (Mathieu et al, 2016)
Highlights
  • Understanding the interaction dynamics of objects and predicting what happens is one of the key capabilities of humans which we heavily rely on to make decisions in everyday life (Bubic et al, 2010)
  • A model that can accurately predict future observations of complex sensory modalities such as vision must internally represent the complex dynamics of real-world objects and people, and is more likely to acquire a representation that can be used for a variety of visual perception tasks, such as object tracking and action recognition (Srivastava et al, 2015; Lotter et al, 2017; Denton & Birodkar, 2017)
  • Since the aim in our method is to recover latent variables that correspond to events which might explain the variability in the videos, we found that it is crucial to condition the inference network on future frames
  • We proposed stochastic variational video prediction (SV2P), an approach for multi-step video prediction based on variational inference
  • Our primary contributions include an effective stochastic prediction method with latent variables, a network architecture that succeeds on natural videos, and a training procedure that provides for stable optimization
  • We evaluated our proposed method on three real-world datasets in actionconditioned and action-free settings, as well as one toy dataset which has been carefully designed to highlight the importance of the stochasticity in video prediction
Methods
  • To evaluate SV2P, the authors test it on three real-world video datasets by comparing it to the CDNA model (Finn et al, 2016), as a deterministic baseline, as well as a baseline that outputs the last seen frame as the prediction.
  • An interesting property of this dataset is the fact that the arm movements are quite unpredictable in the absence of actions (compared to the robot pushing dataset (Finn et al, 2016) which the arm moves to the center of the bin)
  • For this dataset, the authors train the models to predict the ten frames given the first two, both in action-conditioned and action-free settings
Conclusion
  • The authors proposed stochastic variational video prediction (SV2P), an approach for multi-step video prediction based on variational inference.
  • The authors evaluated the proposed method on three real-world datasets in actionconditioned and action-free settings, as well as one toy dataset which has been carefully designed to highlight the importance of the stochasticity in video prediction.
  • Both qualitative and quantitative results indicate higher quality predictions compared to other deterministic and stochastic baselines.
  • This would allow time-variant latent distributions which is more aligned with generative neural models for time-series(Johnson et al, 2016; Gao et al, 2016; Krishnan et al, 2017)
Summary
  • Introduction:

    Understanding the interaction dynamics of objects and predicting what happens is one of the key capabilities of humans which the authors heavily rely on to make decisions in everyday life (Bubic et al, 2010).
  • A model that can accurately predict future observations of complex sensory modalities such as vision must internally represent the complex dynamics of real-world objects and people, and is more likely to acquire a representation that can be used for a variety of visual perception tasks, such as object tracking and action recognition (Srivastava et al, 2015; Lotter et al, 2017; Denton & Birodkar, 2017)
  • Such models can be inherently useful themselves, for example, to allow an autonomous agent or robot to decide how to interact with the world to bring about a desired outcome (Oh et al, 2015; Finn & Levine, 2017).
  • Deterministic models trained with a mean squared error loss function generate the expected value of all the possibilities for each pixel independently, which is inherently blurry (Mathieu et al, 2016)
  • Objectives:

    Since the goal is to perform conditional video prediction, the predictions are conditioned on a set of c context frames x0, . . . , xc−1, and the goal is to sample from p(xc:T |x0:c−1), where xi denotes the ith frame of the video (Figure 2).
  • The authors aim to understand whether the range of possible futures captured by the stochastic model includes the true fu- 23
  • Methods:

    To evaluate SV2P, the authors test it on three real-world video datasets by comparing it to the CDNA model (Finn et al, 2016), as a deterministic baseline, as well as a baseline that outputs the last seen frame as the prediction.
  • An interesting property of this dataset is the fact that the arm movements are quite unpredictable in the absence of actions (compared to the robot pushing dataset (Finn et al, 2016) which the arm moves to the center of the bin)
  • For this dataset, the authors train the models to predict the ten frames given the first two, both in action-conditioned and action-free settings
  • Conclusion:

    The authors proposed stochastic variational video prediction (SV2P), an approach for multi-step video prediction based on variational inference.
  • The authors evaluated the proposed method on three real-world datasets in actionconditioned and action-free settings, as well as one toy dataset which has been carefully designed to highlight the importance of the stochasticity in video prediction.
  • Both qualitative and quantitative results indicate higher quality predictions compared to other deterministic and stochastic baselines.
  • This would allow time-variant latent distributions which is more aligned with generative neural models for time-series(Johnson et al, 2016; Gao et al, 2016; Krishnan et al, 2017)
Tables
  • Table1: Hyper-parameters used for experiments
Download tables as Excel
Related work
  • A number of prior works have addressed video frame prediction while assuming deterministic environments (Ranzato et al, 2014; Srivastava et al, 2015; Vondrick et al, 2015; Xingjian et al, 2015; Boots et al, 2014; Lotter et al, 2017). In this work, we build on the deterministic video prediction model proposed by Finn et al (2016), which generates the future frames by predicting the motion flow of dynamically masked out objects extracted from the previous frames. Similar transformationbased models were also proposed by De Brabandere et al (2016); Liu et al (2017). Prior work has also considered alternative objectives for deterministic video prediction models to mitigate the blurriness of the predicted frames and produce sharper predictions (Mathieu et al, 2016; Vondrick & Torralba, 2017). Despite the adversarial objective, Mathieu et al (2016) found that injecting noise did not lead to stochastic predictions, even for predicting a single frame. Oh et al (2015); Chiappa et al (2017) make sharp video predictions by assuming deterministic outcomes in video games given the actions of the agents. However, this assumption does not hold in real-world settings, which almost always have stochastic dynamics.
Funding
  • This material is based upon work supported by the National Science Foundation under award no. 1725729 and was partially done while author was interning at Google Brain
Study subjects and analysis
real-world video datasets: 3
Our SV2P implementation will be open sourced upon publication. To evaluate SV2P, we test it on three real-world video datasets by comparing it to the CDNA model (Finn et al, 2016), as a deterministic baseline, as well as a baseline that outputs the last seen frame as the prediction. We compare SV2P with an auto-regressive stochastic model, video pixel networks (VPN) (Kalchbrenner et al, 2017)

real-world video datasets: 3
Instead, z will be sampled from assumed prior p(z). To evaluate SV2P, we test it on three real-world video datasets by comparing it to the CDNA model (Finn et al, 2016), as a deterministic baseline, as well as a baseline that outputs the last seen frame as the prediction. We compare SV2P with an auto-regressive stochastic model, video pixel networks (VPN) (Kalchbrenner et al, 2017)

samples: 100
100 videos and show the result of the sample with highest PSNR. For a fair comparison to VPN, we use the same best out of 100 samples for our stochastic baseline. Since even the fast implementation of VPN is quite slow, we limit the comparison with VPN to only last dataset with 256 test samples

test samples: 256
For a fair comparison to VPN, we use the same best out of 100 samples for our stochastic baseline. Since even the fast implementation of VPN is quite slow, we limit the comparison with VPN to only last dataset with 256 test samples. 24 a) Action-free BAIR Dataset

samples: 100
1The videos of these experiments can be found at the project website (https://goo.gl/iywUHc). stochastic methods, we show the best (highest PSNR) and worst (lowest PSNR) predictions out of 100 samples (as discussed in Section 5.2), as well as two random predicted videos from our model. Figure 8 illustrates two examples from the BAIR robot pushing dataset in the action-free setting

real-world datasets: 3
The source code for our method will be released upon acceptance. We evaluated our proposed method on three real-world datasets in actionconditioned and action-free settings, as well as one toy dataset which has been carefully designed to highlight the importance of the stochasticity in video prediction. Both qualitative and quantitative results indicate higher quality predictions compared to other deterministic and stochastic baselines

random samples: 100
Prediction results on the action-free Human3.6M dataset. SV2P predicts a different outcome on each sampling given the latent. In the left example, the model predicts walking as well as stopping which result in different outputs in predicted future frames. Similarly, the right example demonstrates various outcomes including spinning. Comparing the results of video pixel networks (VPN) (Kalchbrenner et al, 2017; Reed et al, 2017) with SV2P on the robotic pushing dataset. We use the same best PSNR out of 100 random samples for both methods. Besides stochastic movements of the pushed objects, another source of stochasticity is the starting lag in movements of the robotic arm. SV2P generates sharper images compared to Finn et al (2016) (notice the pushed objects in zoomed images) with less noise compared to Reed et al (2017) (look at the accumulated noise in later frames).

Reference
  • Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
    Google ScholarLocate open access versionFindings
  • Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth camera & manipulator from raw execution traces. In International Conference on Robotics and Automation (ICRA), 2014.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pp. 10–21, 2016. URL http://aclweb.org/anthology/K/K16/ K16-1002.pdf.
    Locate open access versionFindings
  • Andreja Bubic, D Yves Von Cramon, and Ricarda I Schubotz. Prediction, cognition and the brain. Frontiers in human neuroscience, 4, 2010.
    Google ScholarFindings
  • Baoyang Chen, Wenmin Wang, Jinzhuo Wang, Xiongtao Chen, and Weimian Li. Video imagination from a single image with transformation generation. arXiv preprint arXiv:1706.04124, 2017.
    Findings
Full Text
Your rating :
0

 

Tags
Comments