SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models

ICLR, 2020.

Cited by: 8|Bibtex|Views183
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds

Abstract:

Standard variational lower bounds used to train latent variable models produce biased estimates of most quantities of interest. We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models based on randomized truncation of infinite series. If parameterized by an encoder-decoder architectur...More

Code:

Data:

Introduction
  • Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation.
  • More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks (Kingma & Welling, 2014)
  • These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning (Johnson et al, 2016; Luo et al, 2018).
  • These explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g., Higgins et al (2017))
Highlights
  • Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation
  • Latent variable models are often framed as probabilistic graphical models, allowing these relationships to be expressed in terms of conditional independence
  • More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks (Kingma & Welling, 2014). These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning (Johnson et al, 2016; Luo et al, 2018). These explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g., Higgins et al (2017))
  • There is no theoretical guarantee that this estimator has finite variance, we find that it can work well in practice. We show that this unbiased estimator can train latent variable models to achieve higher test log-likelihood than lower bound estimators at the same expected compute cost
  • We introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds
  • It may be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO
Conclusion
  • The authors introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds.
  • The authors investigated applications involving entropy maximization where a lower bound performs poorly, but the unbiased estimator can train properly with relatively smaller amount of compute.
  • The authors plan to investigate new families of gradient-based optimizers which can handle heavy-tailed stochastic gradients.
  • It may be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO
Summary
  • Introduction:

    Latent variable models are powerful tools for constructing highly expressive data distributions and for understanding how high-dimensional observations might possess a simpler representation.
  • More recently there has been a surge of interest in probabilistic latent variable models that incorporate flexible nonlinear likelihoods via deep neural networks (Kingma & Welling, 2014)
  • These models can blend the advantages of highly structured probabilistic priors with the empirical successes of deep learning (Johnson et al, 2016; Luo et al, 2018).
  • These explicit latent variable models can often yield relatively interpretable representations, in which simple interpolation in the latent space can lead to semantically-meaningful changes in high-dimensional observations (e.g., Higgins et al (2017))
  • Conclusion:

    The authors introduced SUMO, a new unbiased estimator of the log probability for latent variable models, and demonstrated tasks for which this estimator performs better than standard lower bounds.
  • The authors investigated applications involving entropy maximization where a lower bound performs poorly, but the unbiased estimator can train properly with relatively smaller amount of compute.
  • The authors plan to investigate new families of gradient-based optimizers which can handle heavy-tailed stochastic gradients.
  • It may be fruitful to investigate the use of convex combination of consistent estimators within the SUMO approach, as any convex combination is unbiased, or to apply variance reduction methods to increase stability of training with SUMO
Tables
  • Table1: Test negative log-likelihood of the trained model, estimated using IWAE(k=5000). For SUMO, k refers to the expected number of computed terms
Download tables as Excel
Related work
  • There is a long history in Bayesian statistics of marginal likelihood estimation in the service of model selection. The harmonic mean estimator (Newton & Raftery, 1994), for example, has a long (and notorious) history as a consistent estimator of the marginal likelihood that may have infinite variance (Murray & Salakhutdinov, 2009) and exhibits simulation psuedo-bias (Lenk, 2009). The Chib estimator (Chib, 1995), the Laplace approximation, and nested sampling (Skilling, 2006) are alternative proposals that can often have better properties (Murray & Salakhutdinov, 2009). Annealed importance sampling (Neal, 2001) probably represents the gold standard for marginal likelihood estimation. These, however, turn into consistent estimators at best when estimating the log marginal probability (Rainforth et al, 2018a). Bias removal schemes such as jackknife variational inference (Nowozin, 2018) have been proposed to debias log-evidence estimation, IWAE in particular. Hierarchical IWAE (Huang et al, 2019) uses a joint proposal to induce negative correlation among samples and connects the convergence of variance of the estimator and the convergence of the lower bound.
Funding
  • This work was partially funded by NSF IIS-1421780
  • Y.L and J.Z were supported by the NSF China Project (No 61620106010), Beijing NSF Project (No L172037), the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration
Reference
  • James Arvo and David Kirk. Particle transport and image synthesis. ACM SIGGRAPH Computer Graphics, 24(4):63–66, 1990.
    Google ScholarLocate open access versionFindings
  • Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box variational inference. In Advances in Neural Information Processing Systems. 2017.
    Google ScholarLocate open access versionFindings
  • Alex Beatson and Ryan P. Adams. Efficient optimization of loops and limits with randomized telescoping sums. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Mikoaj Bikowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
    Google ScholarLocate open access versionFindings
  • Endre Boros and Peter L Hammer. Pseudo-Boolean optimization. Discrete Applied Mathematics, 123(1-3):155–225, 2002.
    Google ScholarLocate open access versionFindings
  • Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jorn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321, 1995.
    Google ScholarLocate open access versionFindings
  • Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, 1977.
    Google ScholarLocate open access versionFindings
  • Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Justin Domke and Daniel R Sheldon. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pp. 4470–4479, 2018.
    Google ScholarLocate open access versionFindings
  • Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O Roberts. Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (4):755–777, 2008.
    Google ScholarLocate open access versionFindings
  • George E Forsythe and Richard A Leibler. Matrix inversion by a Monte Carlo method. Mathematics of Computation, 4(31):127–129, 1950.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pp. 1846–1855, 2018.
    Google ScholarLocate open access versionFindings
  • Insu Han, Haim Avron, and Jinwoo Shin. Stochastic Chebyshev gradient descent for spectral optimization. In Advances in Neural Information Processing Systems, pp. 7386–7396, 2018.
    Google ScholarLocate open access versionFindings
  • Irina Higgins, Loıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
    Findings
  • Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, and Aaron Courville. Hierarchical importance weighted autoencoders. In International Conference on Machine Learning, pp. 2869–2878, 2019.
    Google ScholarLocate open access versionFindings
  • Pierre E Jacob and Alexandre H Thiery. On nonnegative unbiased estimators. The Annals of Statistics, 43(2):769–784, 2015.
    Google ScholarLocate open access versionFindings
  • Pierre E Jacob, John O’Leary, and Yves F Atchade. Unbiased Markov chain Monte Carlo with couplings. arXiv preprint arXiv:1708.03625, 2017.
    Findings
  • Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pp. 2946–2954, 2016.
    Google ScholarLocate open access versionFindings
  • Herman Kahn. Use of different Monte Carlo sampling techniques. Santa Monica, CA: RAND Corporation, 1955. URL https://www.rand.org/pubs/papers/P766.html.
    Findings
  • Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
    Google ScholarLocate open access versionFindings
  • Julius Kuti. Stochastic method for the numerical study of lattice fermions. Physical Review Letters, 49(3):183, 1982.
    Google ScholarLocate open access versionFindings
  • Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Peter Lenk. Simulation pseudo-bias correction to the harmonic mean estimator of integrated likelihoods. Journal of Computational and Graphical Statistics, 18(4):941–960, 2009.
    Google ScholarLocate open access versionFindings
  • Yucen Luo, Tian Tian, Jiaxin Shi, Jun Zhu, and Bo Zhang. Semi-crowdsourced clustering with deep generative models. In Advances in Neural Information Processing Systems, pp. 3212–3222, 2018.
    Google ScholarLocate open access versionFindings
  • Anne-Marie Lyne, Mark Girolami, Yves Atchade, Heiko Strathmann, Daniel Simpson, et al. On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical science, 30(4):443–467, 2015.
    Google ScholarLocate open access versionFindings
  • Don McLeish. A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods and Applications, 17(4):301–315, 2011.
    Google ScholarLocate open access versionFindings
  • Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc., 2001.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Iain Murray and Ruslan Salakhutdinov. Evaluating probabilities under high-dimensional latent variable models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems 21, pp. 1137–1144. 2009.
    Google ScholarLocate open access versionFindings
  • Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
    Google ScholarLocate open access versionFindings
  • Radford M Neal et al. Slice sampling. The Annals of Statistics, 31(3):705–767, 2003.
    Google ScholarLocate open access versionFindings
  • Michael A Newton and Adrian E Raftery. Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 56(1):3–26, 1994.
    Google ScholarLocate open access versionFindings
  • Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731, 2016.
    Google ScholarLocate open access versionFindings
  • Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel Wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood. On nesting Monte Carlo estimators. In International Conference on Machine Learning, 2018a.
    Google ScholarLocate open access versionFindings
  • Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, 2018b.
    Google ScholarLocate open access versionFindings
  • Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • Chang-han Rhee and Peter W Glynn. A new approach to unbiased estimation for SDEs. In Proceedings of the Winter Simulation Conference, pp. 17. Winter Simulation Conference, 2012.
    Google ScholarLocate open access versionFindings
  • Chang-han Rhee and Peter W Glynn. Unbiased estimation with square root convergence for SDE models. Operations Research, 63(5):1026–1043, 2015.
    Google ScholarLocate open access versionFindings
  • Francisco JR Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational inference. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Tomasz Rychlik. Unbiased nonparametric estimation of the derivative of the mean. Statistics & probability letters, 10(4):329–333, 1990.
    Google ScholarLocate open access versionFindings
  • Tomasz Rychlik. A class of unbiased kernel estimates of a probability density function. Applicationes Mathematicae, 22(4):485–497, 1995.
    Google ScholarLocate open access versionFindings
  • John Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1(4):833– 859, 2006.
    Google ScholarLocate open access versionFindings
  • Jerome Spanier and Ely M Gelbard. Monte Carlo Principles and Neutron Transport Problems. Addison-Wesley Publishing Company, 1969.
    Google ScholarFindings
  • which means the series converges absolutely. This is a sufficient condition for finite expectation of the Russian roulette estimator (Chen et al. (2019); Lemma 3). Applying equation 7 to the series:
    Google ScholarLocate open access versionFindings
  • We follow the analysis of JVI (Nowozin, 2018), which applied the delta method for moments to show the asymptotic results on the bias and variance of
    Google ScholarFindings
  • For clarity, let Ck = Yk − μ be the zero-mean random variable. Nowozin (2018) gives the relations
    Google ScholarLocate open access versionFindings
  • for some constant c. Then we have E [∇θSUMO(x)] = ∇θE [SUMO(x)] = ∇θ log pθ(x) directly by the dominated convergence theorem, as long as SUMO is everywhere differentiable, which is satisfied by all of our experiments. If ReLU neural networks are to be used, one may be able to show the same property using Theorem 5 of Bikowski et al. (2018), assuming finite higher moments and Lipschitz continuity.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments