# SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models

ICLR, 2020.

EI

Weibo:

Abstract:

Standard variational lower bounds used to train latent variable models produce biased estimates of most quantities of interest. We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models based on randomized truncation of infinite series. If parameterized by an encoder-decoder architectur...More

Introduction

- Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation.
- More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks (Kingma & Welling, 2014)
- These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning (Johnson et al, 2016; Luo et al, 2018).
- These explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g., Higgins et al (2017))

Highlights

- Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation
- Latent variable models are often framed as probabilistic graphical models, allowing these relationships to be expressed in terms of conditional independence
- More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks (Kingma & Welling, 2014). These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning (Johnson et al, 2016; Luo et al, 2018). These explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g., Higgins et al (2017))
- There is no theoretical guarantee that this estimator has finite variance, we find that it can work well in practice. We show that this unbiased estimator can train latent variable models to achieve higher test log-likelihood than lower bound estimators at the same expected compute cost
- We introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds
- It may be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO

Conclusion

- The authors introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds.
- The authors investigated applications involving entropy maximization where a lower bound performs poorly, but the unbiased estimator can train properly with relatively smaller amount of compute.
- The authors plan to investigate new families of gradient-based optimizers which can handle heavy-tailed stochastic gradients.
- It may be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO

Summary

## Introduction:

Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation.- More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks (Kingma & Welling, 2014)
- These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning (Johnson et al, 2016; Luo et al, 2018).
- These explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g., Higgins et al (2017))
## Conclusion:

The authors introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds.- The authors investigated applications involving entropy maximization where a lower bound performs poorly, but the unbiased estimator can train properly with relatively smaller amount of compute.
- The authors plan to investigate new families of gradient-based optimizers which can handle heavy-tailed stochastic gradients.
- It may be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO

- Table1: Test negative log-likelihood of the trained model, estimated using IWAE(k=5000). For SUMO, k refers to the expected number of computed terms

Related work

- There is a long history in Bayesian statistics of marginal likelihood estimation in the service of model selection. The harmonic mean estimator (Newton & Raftery, 1994), for example, has a long (and notorious) history as a consistent estimator of the marginal likelihood that may have infinite variance (Murray & Salakhutdinov, 2009) and exhibits simulation psuedo-bias (Lenk, 2009). The Chib estimator (Chib, 1995), the Laplace approximation, and nested sampling (Skilling, 2006) are alternative proposals that can often have better properties (Murray & Salakhutdinov, 2009). Annealed importance sampling (Neal, 2001) probably represents the gold standard for marginal likelihood estimation. These, however, turn into consistent estimators at best when estimating the log marginal probability (Rainforth et al, 2018a). Bias removal schemes such as jackknife variational inference (Nowozin, 2018) have been proposed to debias log-evidence estimation, IWAE in particular. Hierarchical IWAE (Huang et al, 2019) uses a joint proposal to induce negative correlation among samples and connects the convergence of variance of the estimator and the convergence of the lower bound.

Funding

- This work was partially funded by NSF IIS-1421780
- Y.L and J.Z were supported by the NSF China Project (No 61620106010), Beijing NSF Project (No L172037), the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration

Reference

- James Arvo and David Kirk. Particle transport and image synthesis. ACM SIGGRAPH Computer Graphics, 24(4):63–66, 1990.
- Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box variational inference. In Advances in Neural Information Processing Systems. 2017.
- Alex Beatson and Ryan P. Adams. Efficient optimization of loops and limits with randomized telescoping sums. In International Conference on Machine Learning, 2019.
- Mikoaj Bikowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.
- David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
- Endre Boros and Peter L Hammer. Pseudo-Boolean optimization. Discrete Applied Mathematics, 123(1-3):155–225, 2002.
- Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
- Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jorn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 2019.
- Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321, 1995.
- Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, 1977.
- Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
- Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pp. 4470–4479, 2018.
- Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O Roberts. Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (4):755–777, 2008.
- George E Forsythe and Richard A Leibler. Matrix inversion by a Monte Carlo method. Mathematics of Computation, 4(31):127–129, 1950.
- Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pp. 1846–1855, 2018.
- Insu Han, Haim Avron, and Jinwoo Shin. Stochastic Chebyshev gradient descent for spectral optimization. In Advances in Neural Information Processing Systems, pp. 7386–7396, 2018.
- Irina Higgins, Loıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Machine Learning, 2017.
- Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
- Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, 2018.
- Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, and Aaron Courville. Hierarchical importance weighted autoencoders. In International Conference on Machine Learning, pp. 2869–2878, 2019.
- Pierre E Jacob and Alexandre H Thiery. On nonnegative unbiased estimators. The Annals of Statistics, 43(2):769–784, 2015.
- Pierre E Jacob, John O’Leary, and Yves F Atchade. Unbiased Markov chain Monte Carlo with couplings. arXiv preprint arXiv:1708.03625, 2017.
- Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pp. 2946–2954, 2016.
- Herman Kahn. Use of different Monte Carlo sampling techniques. Santa Monica, CA: RAND Corporation, 1955. URL https://www.rand.org/pubs/papers/P766.html.
- Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
- Julius Kuti. Stochastic method for the numerical study of lattice fermions. Physical Review Letters, 49(3):183, 1982.
- Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Peter Lenk. Simulation pseudo-bias correction to the harmonic mean estimator of integrated likelihoods. Journal of Computational and Graphical Statistics, 18(4):941–960, 2009.
- Yucen Luo, Tian Tian, Jiaxin Shi, Jun Zhu, and Bo Zhang. Semi-crowdsourced clustering with deep generative models. In Advances in Neural Information Processing Systems, pp. 3212–3222, 2018.
- Anne-Marie Lyne, Mark Girolami, Yves Atchade, Heiko Strathmann, Daniel Simpson, et al. On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical science, 30(4):443–467, 2015.
- Don McLeish. A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods and Applications, 17(4):301–315, 2011.
- Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc., 2001.
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
- Iain Murray and Ruslan Salakhutdinov. Evaluating probabilities under high-dimensional latent variable models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems 21, pp. 1137–1144. 2009.
- Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
- Radford M Neal et al. Slice sampling. The Annals of Statistics, 31(3):705–767, 2003.
- Michael A Newton and Adrian E Raftery. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 56(1):3–26, 1994.
- Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731, 2016.
- Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018.
- Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel Wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, 2018.
- Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood. On nesting Monte Carlo estimators. In International Conference on Machine Learning, 2018a.
- Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, 2018b.
- Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
- Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
- Chang-han Rhee and Peter W Glynn. A new approach to unbiased estimation for SDEs. In Proceedings of the Winter Simulation Conference, pp. 17. Winter Simulation Conference, 2012.
- Chang-han Rhee and Peter W Glynn. Unbiased estimation with square root convergence for SDE models. Operations Research, 63(5):1026–1043, 2015.
- Francisco JR Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational inference. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016.
- Tomasz Rychlik. Unbiased nonparametric estimation of the derivative of the mean. Statistics & probability letters, 10(4):329–333, 1990.
- Tomasz Rychlik. A class of unbiased kernel estimates of a probability density function. Applicationes Mathematicae, 22(4):485–497, 1995.
- John Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1(4):833– 859, 2006.
- Jerome Spanier and Ely M Gelbard. Monte Carlo Principles and Neutron Transport Problems. Addison-Wesley Publishing Company, 1969.
- which means the series converges absolutely. This is a sufficient condition for finite expectation of the Russian roulette estimator (Chen et al. (2019); Lemma 3). Applying equation 7 to the series:
- We follow the analysis of JVI (Nowozin, 2018), which applied the delta method for moments to show the asymptotic results on the bias and variance of
- For clarity, let Ck = Yk − μ be the zero-mean random variable. Nowozin (2018) gives the relations
- for some constant c. Then we have E [∇θSUMO(x)] = ∇θE [SUMO(x)] = ∇θ log pθ(x) directly by the dominated convergence theorem, as long as SUMO is everywhere differentiable, which is satisfied by all of our experiments. If ReLU neural networks are to be used, one may be able to show the same property using Theorem 5 of Bikowski et al. (2018), assuming finite higher moments and Lipschitz continuity.

Full Text

Tags

Comments