# Denoising Diffusion Implicit Models

international conference on learning representations, 2020.

Weibo:

Abstract:

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilist...More

Introduction

- Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al, 2020; van den Oord et al, 2016a).
- Recent works on iterative generative models (Bengio et al, 2014), such as denoising diffusion probabilistic models (DDPM, Ho et al (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019)) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training.
- Samples are produced by a Markov chain which, starting from white noise, progressively denoises it into an image
- This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al, 2015).
- The parameters θ are learned to fit the data distribution q(x0) by maximizing a variational lower bound: max θ

Highlights

- Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al, 2020; van den Oord et al, 2016a)
- It takes around 20 hours to sample 50k images of size 32 × 32 from a Denoising diffusion probabilistic models (DDPMs), but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU. This becomes more problematic for larger images as sampling 50k images of size 256 × 256 could take nearly 1000 hours on the same GPU. To close this efficiency gap between DDPMs and GANs, we present denoising diffusion implicit models (DDIMs)
- We show that only slight changes to the updates in Eq (12) are needed to obtain the new, faster generative processes, which applies to DDPM, DDIM, as well as all generative processes considered in Eq (10)
- We have presented DDIMs – an implicit generative model trained with denoising auto-encoding / score matching objectives – from a purely variational perspective
- Since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multistep methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al, 2020)
- When the length of the sampling trajectory is much smaller than T, we may achieve significant increases in computational efficiency due to the iterative nature of the sampling process
- It is relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al, 2019)

Methods

- The authors show that DDIMs outperform DDPMs in terms of image generation when fewer iterations are considered, giving speed ups of 10× to 100× over the original DDPM generation process.
- The authors use the same trained model with T = 1000 and the objective being Lγ from Eq (5) with γ = 1; as the authors argued in Section 3, no changes are needed with regards to the training procedure.
- The authors consider different sub-sequences τ of [1, .

Results

- When the length of the sampling trajectory is much smaller than T , the authors may achieve significant increases in computational efficiency due to the iterative nature of the sampling process.

Conclusion

- The authors have presented DDIMs – an implicit generative model trained with denoising auto-encoding / score matching objectives – from a purely variational perspective.
- Since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multistep methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al, 2020).
- It is relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al, 2019)

Summary

## Introduction:

Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al, 2020; van den Oord et al, 2016a).- Recent works on iterative generative models (Bengio et al, 2014), such as denoising diffusion probabilistic models (DDPM, Ho et al (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019)) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training.
- Samples are produced by a Markov chain which, starting from white noise, progressively denoises it into an image
- This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al, 2015).
- The parameters θ are learned to fit the data distribution q(x0) by maximizing a variational lower bound: max θ
## Methods:

The authors show that DDIMs outperform DDPMs in terms of image generation when fewer iterations are considered, giving speed ups of 10× to 100× over the original DDPM generation process.- The authors use the same trained model with T = 1000 and the objective being Lγ from Eq (5) with γ = 1; as the authors argued in Section 3, no changes are needed with regards to the training procedure.
- The authors consider different sub-sequences τ of [1, .
## Results:

When the length of the sampling trajectory is much smaller than T , the authors may achieve significant increases in computational efficiency due to the iterative nature of the sampling process.## Conclusion:

The authors have presented DDIMs – an implicit generative model trained with denoising auto-encoding / score matching objectives – from a purely variational perspective.- Since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multistep methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al, 2020).
- It is relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al, 2019)

- Table1: CIFAR10 and CelebA image generation measured in FID. η = 1.0 and σare cases of DDPM (although Ho et al (2020) only considered T = 1000 steps, and S < T can be seen as simulating DDPMs trained with S steps), and η = 0.0 indicates DDIM
- Table2: Reconstruction error with DDIM on CIFAR-10 test set, rounded to 10−4
- Table3: LSUN Bedroom and Church image generation results, measured in FID. For 1000 steps DDPM, the FIDs are 6.36 for Bedroom and 7.89 for Church

Related work

- Our work is based on a large family of existing methods on learning generative models as transition operators of Markov chains (Sohl-Dickstein et al, 2015; Bengio et al, 2014; Salimans et al, 2014; Song et al, 2017; Goyal et al, 2017; Levy et al, 2017). Among them, denoising diffusion probabilistic models (DDPMs, Ho et al (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019; 2020)) have recently achieved high sample quality comparable to GANs (Brock et al, 2018; Karras et al, 2018). DDPMs optimize a variational lower bound to the log-likelihood, whereas NCSNs optimize the score matching objective (Hyvarinen, 2005) over a nonparametric Parzen density estimator of the data (Vincent, 2011; Raphan & Simoncelli, 2011).

Despite their different motivations, DDPMs and NCSNs are closely related. Both use a denoising autoencoder objective for many noise levels, and both use a procedure similar to Langevin dynamics to produce samples (Neal et al, 2011). Since Langevin dynamics is a discretization of a gradient flow (Jordan et al, 1998), both DDPM and NCSN require many steps to achieve good sample quality. This aligns with the observation that DDPM and existing NCSN methods have trouble generating high-quality samples in a few iterations.

Funding

- When the length of the sampling trajectory is much smaller than T , we may achieve significant increases in computational efficiency due to the iterative nature of the sampling process

Study subjects and analysis

Ti GPU and samples: 2080

CIFAR10 and CelebA samples with dim(τ ) = 10 and dim(τ ) = 100. Hours to sample 50k images with one Nvidia 2080 Ti GPU and samples at different steps. Samples from DDIM with the same random xT and different number of steps

Reference

- Martin Arjovsky, Soumith Chintala, and Leon Bottou. arXiv:1701.07875, January 2017.
- David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511, 2019.
- Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226–234, January 2014.
- Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
- Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, September 2018.
- John Charles Butcher and Nicolette Goodwin. Numerical methods for ordinary differential equations, volume 2. Wiley Online Library, 2008.
- Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, September 2020.
- Ricky T Q Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, June 2018.
- Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, May 2016.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392–4402, 2017.
- Will Grathwohl, Ricky T Q Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, October 2018.
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two Time-Scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, June 2017.
- Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, June 2020.
- Aapo Hyvarinen. Estimation of Non-Normalized statistical models by score matching. Journal of Machine Learning Researc h, 6:695–709, 2005.
- Alexia Jolicoeur-Martineau, Remi Piche-Taillefer, Remi Tachet des Combes, and Ioannis Mitliagkas. Adversarial score matching and improved sampling for image generation. September 2020.
- Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker– planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Tero Karras, Samuli Laine, and Timo Aila. A Style-Based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, December 2018.
- Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119, 2020.
- Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. arXiv preprint arXiv:1312.6114v10, December 2013.
- Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
- Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, October 2016.
- Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
- Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuousin-depth neural networks. arXiv preprint arXiv:2008.02389, 2020.
- Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, February 2011. ISSN 0899-7667, 1530-888X.
- Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, May 2015.
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, October 2014.
- Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp. 245–254, 1985.
- Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, March 2015.
- Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. arXiv preprint arXiv:1706.07561, June 2017.
- Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600, July 2019.
- Yang Song and Stefano Ermon. Improved techniques for training Score-Based generative models. arXiv preprint arXiv:2006.09011, June 2020.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, September 2016a.
- Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, January 2016b.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- Under review as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, May 2016. Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon.
- Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pp. 10792–10801, 2018.
- From Bishop (2006) (2.115), we have that q(xt−1|x0) is Gaussian and
- We note that in Ho et al. (2020), a diffusion hyperparameter βt8 is first introduced, and then relevant variables αt:= 1 − βt and αt =
- In this paper, we have used the notation αt to represent the variable αt in Ho et al. (2020) for three reasons. First, it makes it more clear that we only need to choose one set of hyperparameters, reducing possible cross-references of the derived variables. Second, it allows us to introduce the generalization as well as the acceleration case easier, because the inference process is no longer motivated by a diffusion. Third, there exists an isomorphism between α1:T and 1,..., T, which is not the case for βt.
- In this section we use teal to color notations used in Ho et al. (2020).
- In this section, we use βt and αt to be more consistent with the derivation in Ho et al. (2020), where αt
- Ho et al. (2020) considered a specific type of p(θt)(xt−1|xt): p(θt)(xt−1|xt) = N (μθ(xt, t), σtI)
- Ho et al. (2020) chose the parametrization μθ(xt, t)
- We consider 4 image datasets with various resolutions: CIFAR10 (32 × 32, unconditional), CelebA (64 × 64), LSUN Bedroom (256 × 256) and LSUN Church (256 × 256). For all datasets, we set the hyperparameters α according to the heuristic in (Ho et al., 2020) to make the results directly comparable. We use the same model for each dataset, and only compare the performance of different generative processes. For CIFAR10, Bedroom and Church, we obtain the pretrained checkpoints from the original DDPM implementation; for CelebA, we trained our own model using the denoising objective L1.
- Our architecture for (θt)(xt) follows that in Ho et al. (2020), which is a U-Net (Ronneberger et al., 2015) based on a Wide ResNet (Zagoruyko & Komodakis, 2016). We use the pretrained models from Ho et al. (2020) for CIFAR10, Bedroom and Church, and train our own model for the CelebA 64 × 64 model (since a pretrained model is not provided). Our CelebA model has five feature map resolutions from 64 × 64 to 4 × 4, and we use the original CelebA dataset (not CelebA-HQ) using the pre-processing technique from the StyleGAN (Karras et al., 2018) repository.

Full Text

Tags

Comments