VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation

Manoj Kumar
Manoj Kumar
Mohammad Babaeizadeh
Mohammad Babaeizadeh
Durk Kingma
Durk Kingma

ICLR, 2020.

Cited by: 0|Bibtex|Views193
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
Fréchet Video Distance: We evaluate VideoFlow using the recently proposed Fréchet Video Distance metric, an adaptation of the Fréchet Inception Distance metric for video generation. report results with models trained on a total of 16 frames with 2 conditioning frames; while we tr...

Abstract:

Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Althoug...More
Highlights
  • Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream
  • We study the problem of stochastic prediction, focusing on the case of conditional video prediction: synthesizing raw RGB video frames conditioned on a short context
  • We further describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation (Kingma & Dhariwal, 2018), which we call VideoFlow
  • We applied low-temperature sampling to the latent gaussian priors of SV2P and SAVP-variational auto-encoders and empirically found it to hurt performance. We report these results in Figure 12 For SAVP-variational auto-encoders, we notice that the hyperparameters that perform the best on these metrics are the ones that have disappearing arms
  • Fréchet Video Distance (FVD): We evaluate VideoFlow using the recently proposed Fréchet Video Distance (FVD) metric (Unterthiner et al, 2018), an adaptation of the Fréchet Inception Distance (FID) metric (Heusel et al, 2017) for video generation. (Unterthiner et al, 2018) report results with models trained on a total of 16 frames with 2 conditioning frames; while we train our VideoFlow
  • We describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation Kingma & Dhariwal (2018), which we call VideoFlow
Summary
  • Exponential progress in the capabilities of computational hardware, paired with a relentless effort towards greater insights and better methods, has pushed the field of machine learning from relative obscurity into the mainstream.
  • Flow-based generative models (Dinh et al, 2014; 2016) have a unique set of advantages: exact latentvariable inference, exact log-likelihood evaluation, and parallel sampling.
  • We use the multi-scale architecture described above to infer the set of corresponding latent variables for each individual frame of the video: {z}Ll=1 = fθ; see Figure 1 for an illustration.
  • Note that in our architecture we have chosen to let the prior pθ(z), as described in eq (5), model temporal dependencies in the data, while constraining the flow gθ to act on separate frames of video.
  • On generating videos conditioned on the first frame, we observe that the model consistently predicts the future trajectory of the shape to be one of the eight random directions.
  • For a given set of conditioning frames in the BAIR action-free test-set, we generate 100 videos from each of the stochastic models.
  • We can remove pixel-level noise in our VideoFlow model resulting in higher quality videos at the cost of diversity by sampling videos at a lower temperature, analogous to the procedure in (Kingma & Dhariwal, 2018).
  • For a network trained with additive coupling layers, we can sample the tth frame xt from P with a temperature T by scaling the standard deviation of the latent gaussian distribution P by a factor of T .
  • BAIR robot pushing dataset: We encode the first input frame and the last target frame into the latent space using our trained VideoFlow encoder and perform interpolations.
  • We use our trained VideoFlow model, conditioned on 3 frames as explained in Section 5.2, to detect the plausibility of a temporally inconsistent frame to occur in the immediate future.
  • We open-source various components of our trained VideoFlow model, to evaluate log-likelihood, to generate frames and compute latent codes as reusable TFHub modules
  • We describe a practically applicable architecture for flow-based video prediction models, inspired by the Glow model for image generation Kingma & Dhariwal (2018), which we call VideoFlow.
  • We introduce a latent dynamical system model that predicts future values of the flow model’s latent state replacing the standard unconditional prior distribution.
  • Our model optimizes log-likelihood directly making it easy to evaluate while achieving faster synthesis compared to pixel-level autoregressive video models, making our model suitable for practical purposes.
  • We plan to incorporate memory in VideoFlow to model arbitrary long-range dependencies and apply the model to challenging downstream tasks
Tables
  • Table1: We compare the realism of the generated trajectories using a real-vs-fake 2AFC Amazon Mechanical Turk with SAVP-VAE and SV2P
  • Table2: Left: We report the average bits-per-pixel across 10 target frames with 3 conditioning frames for the BAIR action-free dataset
  • Table3: Fréchet Video Distance:. We report the mean and standard deviation across 5 runs for 3 different frame settings. Results are not directly comparable across models due to the differences between the total number of frames seen during training and the number of conditioning frames
Related work
  • Early work on prediction of future video frames focused on deterministic predictive models (Ranzato et al, 2014; Srivastava et al, 2015; Vondrick et al, 2015; Xingjian et al, 2015; Boots et al, 2014). Much of this research on deterministic models focused on architectural changes, such as predicting high-level structure (Villegas et al, 2017b), energy-based models (Xie et al, 2017), generative cooperative nets (Xie et al, 2020), ABPTT (Xie et al, 2019), incorporating pixel transformations (Finn et al, 2016; De Brabandere et al, 2016; Liu et al, 2017) and predictive coding architectures (Lotter et al, 2017), as well as different generation objectives (Mathieu et al, 2016; Vondrick & Torralba, 2017; Walker et al, 2015) and disentangling representations (Villegas et al, 2017a; Denton & Birodkar, 2017). With models that can successfully model many deterministic environments, the next key challenge is to address stochastic environments by building models that can effectively reason over uncertain futures. Real-world videos are always somewhat stochastic, either due to events that are inherently random, or events that are caused by unobserved or partially observable factors, such as off-screen events, humans and animals with unknown intentions, and objects with unknown physical properties. In such cases, since deterministic models can only generate one future, these models either disregard potential futures or produce blurry predictions that are the superposition or averages of possible futures.
Full Text
Your rating :
0

 

Tags
Comments