# Autoregressive Score Matching

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Autoregressive models use chain rule to define a joint probability distribution as a product of conditionals. These conditionals need to be normalized, imposing constraints on the functional families that can be used. To increase flexibility, we propose autoregressive conditional score models (AR-CSM) where we parameterize the joint dis...More

Code:

Data:

Introduction

- Autoregressive models play a crucial role in modeling high-dimensional probability distributions.
- Score matching (SM) [9] trains EBMs by minimizing the Fisher divergence between model and data distributions
- It compares distributions in terms of their log-likelihood gradients and completely circumvents the intractable partition function.
- Let x = (x1, ..., xD) ∈ RD
- They first learn a set of one dimensional conditional energies Eθ(xd|x
- Approximating the partition function introduces bias into optimization and requires extra computation and memory usage, lowering the training efficiency

Highlights

- Autoregressive models play a crucial role in modeling high-dimensional probability distributions
- We show that autoregressive conditional score models (AR-Composite Score Matching (CSM)) can be used for single-step denoising [22, 28] and report the denoising results for MNIST, with noise level σ = 0.6 in the rescaled space in Figure 5
- We propose a divergence between distributions, named Composite Score Matching (CSM), which depends only on the derivatives of univariate log-conditionals of the model
- Based on CSM divergence, we introduce a family of models dubbed AR-CSM, which allows us to expand the capacity of existing autoregressive likelihood-based models by removing the normalizing constraints of conditional distributions
- Despite the empirical success of AR-CSM, sampling from the model is relatively slow since each variable has to be sampled sequentially according to some order
- It would be interesting to investigate methods that accelerate the sampling procedure in AR-CSMs, or consider more efficient variable orders that could be learned from data

Results

- The authors compare the samples from AR-CSM with the ones from MADE and PixelCNN++ with similar autoregressive architectures but trained via maximum likelihood estimation.
- The authors observe that the MADE model trained by CSM is able to generate sharper and higher quality samples than its maximum-likelihood counterpart using Gaussian densities.
- The authors show that AR-CSM can be used for single-step denoising [22, 28] and report the denoising results for MNIST, with noise level σ = 0.6 in the rescaled space in Figure 5
- These results qualitatively demonstrate the effectiveness of AR-CSM for image denoising, showing that the models are sufficiently expressive to capture complex distributions and solve difficult tasks.
- Samples for the other methods can be found in Appendix E

Conclusion

- The authors propose a divergence between distributions, named Composite Score Matching (CSM), which depends only on the derivatives of univariate log-conditionals of the model.
- Despite the empirical success of AR-CSM, sampling from the model is relatively slow since each variable has to be sampled sequentially according to some order.
- It would be interesting to investigate methods that accelerate the sampling procedure in AR-CSMs, or consider more efficient variable orders that could be learned from data.

Summary

## Introduction:

Autoregressive models play a crucial role in modeling high-dimensional probability distributions.- Score matching (SM) [9] trains EBMs by minimizing the Fisher divergence between model and data distributions
- It compares distributions in terms of their log-likelihood gradients and completely circumvents the intractable partition function.
- Let x = (x1, ..., xD) ∈ RD
- They first learn a set of one dimensional conditional energies Eθ(xd|x
- Approximating the partition function introduces bias into optimization and requires extra computation and memory usage, lowering the training efficiency
## Results:

The authors compare the samples from AR-CSM with the ones from MADE and PixelCNN++ with similar autoregressive architectures but trained via maximum likelihood estimation.- The authors observe that the MADE model trained by CSM is able to generate sharper and higher quality samples than its maximum-likelihood counterpart using Gaussian densities.
- The authors show that AR-CSM can be used for single-step denoising [22, 28] and report the denoising results for MNIST, with noise level σ = 0.6 in the rescaled space in Figure 5
- These results qualitatively demonstrate the effectiveness of AR-CSM for image denoising, showing that the models are sufficiently expressive to capture complex distributions and solve difficult tasks.
- Samples for the other methods can be found in Appendix E
## Conclusion:

The authors propose a divergence between distributions, named Composite Score Matching (CSM), which depends only on the derivatives of univariate log-conditionals of the model.- Despite the empirical success of AR-CSM, sampling from the model is relatively slow since each variable has to be sampled sequentially according to some order.
- It would be interesting to investigate methods that accelerate the sampling procedure in AR-CSMs, or consider more efficient variable orders that could be learned from data.

- Table1: AUROC scores for models trained on hθ (x)
- Table2: VAE results on MNIST and CelebA

Related work

- Likelihood-based deep generative models (e.g., flow models, autoregressive models) have been widely used for modeling high dimensional data distributions. Although such models have achieved promising results, they tend to have extra constraints which could limit the model performance. For instance, flow [2, 10] and autoregressive [27, 17] models require normalized densities, while variational auto-encoders (VAE) [11] need to use surrogate losses.

Unnormalized statistical models allow one to use more flexible networks, but require new training strategies. Several approaches have been proposed to train unnormalized statistical models, all with certain types of limitations. Ref. [3] proposes to use Langevin dynamics together with a sample replay buffer to train an energy based model, which requires more iterations over a deep neural network for sampling during training. Ref. [31] proposes a variational framework to train energy-based models by minimizing general f -divergences, which also requires expensive Langevin dynamics to obtain samples during training. Ref. [15] approximates the unnormalized density using importance sampling, which introduces bias during optimization and requires extra computation during training. There are other approaches that focus on modeling the log-likelihood gradients (scores) of the distributions. For instance, score matching (SM) [9] trains an unnormalized model by minimizing Fisher divergence, which introduces a new term that is expensive to compute for high dimensional data. Denoising score matching [28] is a variant of score matching that is fast to train. However, the performance of denoising score matching can be very sensitive to the perturbed noise distribution and heuristics have to be used to select the noise level in practice. Sliced score matching [25] approximates SM by projecting the scores onto random vectors. Although it can be used to train high dimensional data much more efficiently than SM, it provides a trade-off between computational complexity and variance introduced while approximating the SM objective. By contrast, CSM is a deterministic objective function that is efficient and stable to optimize.

Funding

- Acknowledgments and Disclosure of Funding This research was supported by TRI, Amazon AWS, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and FLI

Study subjects and analysis

commonly used image datasets: 3

Our results show that CSM is more stable to optimize and more scalable to high dimensional data compared to the previous score matching methods. We then perform density estimation on 2-d synthetic datasets (see Appendix B) and three commonly used image datasets: MNIST, CIFAR-10 [12] and CelebA [13]. We further show that our method can also be applied to image denoising and anomaly detection, illustrating broad applicability of our method

image datasets: 3

In this section, we show that our method is also capable of modeling natural images. We focus on three image datasets, namely MNIST, CIFAR-10, and CelebA. Setup We select two existing autoregressive models — MADE [4] and PixelCNN++ [21], as the autoregressive architectures for AR-CSM

Reference

- A. P. Dawid and M. Musio. Theory and applications of proper scoring rules. Metron, 72(2):169– 183, 2014.
- L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Y. Du and I. Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
- M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549–581, 1994.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
- F. Huszár. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235, 2017.
- A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
- D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
- J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464, 2012.
- C. Nash and C. Durkan. Autoregressive energy machines. arXiv preprint arXiv:1904.05626, 2019.
- R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
- A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
- G. Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378–384, 1981.
- G. O. Roberts, R. L. Tweedie, et al. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
- T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
- S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
- J. Shi, S. Sun, and J. Zhu. A spectral approach to gradient estimation for implicit distributions. arXiv preprint arXiv:1806.02925, 2018.
- Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
- Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204, 2019.
- C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981.
- A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
- P. Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
- L. Yu, Y. Song, J. Song, and S. Ermon. Training deep energy-based models with f-divergence minimization. arXiv preprint arXiv:2003.03463, 2020.

Full Text

Tags

Comments