Markovian Score Climbing: Variational Inference with KL(p||q)

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views32
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL

Abstract:

Modern variational inference (VI) uses stochastic gradients to avoid intractable expectations, enabling large-scale probabilistic inference in complex models. VI posits a family of approximating distributions $q$ and then finds the member of that family that is closest to the exact posterior $p$. Traditionally, VI algorithms minimize th...More

Code:

Data:

0
Introduction
  • Variational inference (VI) is an optimization-based approach for approximate posterior inference.
  • It posits a family of approximating distributions q and finds the member of that family that is closest to the exact posterior p.
  • In Bayesian inference the main concern is computing the posterior distribution p(z | x), the conditional distribution of the latent variables given the observed data.
  • For most models of interest, exactly computing the posterior is intractable, and the authors must approximate it.
  • Minimize a metric or divergence so that the variational approximation is close to the posterior, i.e. so that q(z ; λ) ≈ p(z | x)
Highlights
  • Variational inference (VI) is an optimization-based approach for approximate posterior inference
  • We develop Markovian score climbing (MSC), a simple algorithm for reliably minimizing the inclusive KL
  • When using gradient descent to optimize the inclusive KL we must compute an expectation of the score function s(z ; λ) eq (5) with respect to the true posterior. To avoid this intractable expectation we propose to use stochastic gradients estimated using samples generated from a Markov chain Monte Carlo (MCMC) algorithm, with the posterior as its stationary distribution
  • We propose to use Markovian score climbing based on the Fisher identity of the gradient gML(θ ) = ∇θ log p(x ; θ ) = ∇θ log p(z, x ; θ ) dz
  • In this paper we have argued, and illustrated numerically, that such underestimation of uncertainty can still be an issue, if the optimization is based on biased gradient estimates, as is the case for previously proposed VI algorithms
  • We introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL
Methods
  • Method MSC

    IS (b) Heart μ3 (e) Ionos μ17. MSC IS (c) Heart μ4 (f) Ionos μ27

    bution q(z ; λ) = (z ; μ, Σ) , where Σ is a diagonal covariance matrix.
  • MSC IS (c) Heart μ4 (f) Ionos μ27.
  • EP requires more model-specific derivations and can be difficult to implement when the moment matching subproblem can not be solved in closed form.
Results
  • Additional Results Bayesian Probit Regression

    The authors compare the posterior uncertainty learnt using MSC and IS.
  • Figure 4 shows difference in the log-standard deviation between the posterior approximation learnt using MSC and that using IS, i.e. log σMSC − log σIS.
  • The authors can see that for two of the datasets, Heart and Ionos, MSC on average learns a posterior approximation with higher uncertainty.
  • To be able to exactly compute the perturbed posterior the authors keep the number of data points small n = 10.
  • The authors show the true and perturbed posteriors for two randomly generated datasets with m = 2, 5, 9.
  • Bias in χ-divergence variational inference (CHIVI) Figure 6 illustrates the systematic error introduced in the optimal parameters of CHIVI when using biased gradients.
Conclusion
  • The properties of the approximation q, to the posterior p, depends on the choice of divergence that is minimized.
  • The most common choice is the exclusive KL divergence KL (q p), which is computationally convenient, but known to suffer from underestimation of the posterior uncertainty.
  • The authors introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL
  • This results in a method that melds VI and MCMC.
  • The authors have illustrated its convergence properties on a simple toy example, and studied its performance on Bayesian probit regression for classification as well as a stochastic volatility model for financial data
Summary
  • Introduction:

    Variational inference (VI) is an optimization-based approach for approximate posterior inference.
  • It posits a family of approximating distributions q and finds the member of that family that is closest to the exact posterior p.
  • In Bayesian inference the main concern is computing the posterior distribution p(z | x), the conditional distribution of the latent variables given the observed data.
  • For most models of interest, exactly computing the posterior is intractable, and the authors must approximate it.
  • Minimize a metric or divergence so that the variational approximation is close to the posterior, i.e. so that q(z ; λ) ≈ p(z | x)
  • Methods:

    Method MSC

    IS (b) Heart μ3 (e) Ionos μ17. MSC IS (c) Heart μ4 (f) Ionos μ27

    bution q(z ; λ) = (z ; μ, Σ) , where Σ is a diagonal covariance matrix.
  • MSC IS (c) Heart μ4 (f) Ionos μ27.
  • EP requires more model-specific derivations and can be difficult to implement when the moment matching subproblem can not be solved in closed form.
  • Results:

    Additional Results Bayesian Probit Regression

    The authors compare the posterior uncertainty learnt using MSC and IS.
  • Figure 4 shows difference in the log-standard deviation between the posterior approximation learnt using MSC and that using IS, i.e. log σMSC − log σIS.
  • The authors can see that for two of the datasets, Heart and Ionos, MSC on average learns a posterior approximation with higher uncertainty.
  • To be able to exactly compute the perturbed posterior the authors keep the number of data points small n = 10.
  • The authors show the true and perturbed posteriors for two randomly generated datasets with m = 2, 5, 9.
  • Bias in χ-divergence variational inference (CHIVI) Figure 6 illustrates the systematic error introduced in the optimal parameters of CHIVI when using biased gradients.
  • Conclusion:

    The properties of the approximation q, to the posterior p, depends on the choice of divergence that is minimized.
  • The most common choice is the exclusive KL divergence KL (q p), which is computationally convenient, but known to suffer from underestimation of the posterior uncertainty.
  • The authors introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL
  • This results in a method that melds VI and MCMC.
  • The authors have illustrated its convergence properties on a simple toy example, and studied its performance on Bayesian probit regression for classification as well as a stochastic volatility model for financial data
Tables
  • Table1: Test error for Bayesian probit regression; lower is better. Estimated using EP (<a class="ref-link" id="cMinka_2001_a" href="#rMinka_2001_a">Minka, 2001</a>), IS (cf. <a class="ref-link" id="cBornschein_2015_a" href="#rBornschein_2015_a">Bornschein and Bengio (2015</a>)), and MSC (this paper) for 3 UCI datasets. Predictive performance is comparable between the methods
Download tables as Excel
Related work
  • Much recent effort in VI has focused on optimizing cost functions that are not the exclusive KL divergence. Li and Turner (2016) and Dieng et al (2017) study Rényi divergences and χ divergences, respectively. The most similar to our work are the methods by Bornschein and Bengio (2015); Gu et al (2015); Finke and Thiery (2019), using IS or SMC to optimize the inclusive KL divergence. The RWS algorithm by Bornschein and Bengio (2015) uses IS both to optimize model parameters and the variational approximation. Neural adaptive SMC by Gu et al (2015) jointly learn an approximation to the posterior and optimize the marginal likelihood of time series with gradients estimated by SMC. Finke and Thiery (2019) draw connections between importance weighted autoencoders (Burda et al, 2016), adaptive IS and methods like the RWS. These three works all rely on IS or SMC to estimate expectations with respect to the posterior. This introduces a systematic bias in the gradients that leads to a solution which is not a local optimum to the inclusive KL divergence.
Study subjects and analysis
UCI datasets: 3
. Test error for Bayesian probit regression; lower is better. Estimated using EP (Minka, 2001), IS (cf. Bornschein and Bengio (2015)), and MSC (this paper) for 3 UCI datasets. Predictive performance is comparable between the methods. MSC converges to the true solution, while the biased IS approach does not. Example of learnt variational parameters for IS- and MSC-based gradients of the inclusive KL, as well as true parameters. The example uses a Gaussian approximation to a skew normal posterior distribution. Iterations in log-scale

Reference
  • C. Andrieu and M. Vihola. Markovian stochastic approximation with expanding projections. Bernoulli, 20(2), Nov. 2014.
    Google ScholarLocate open access versionFindings
  • C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3): 269–342, 2010.
    Google ScholarLocate open access versionFindings
  • E. Angelino, M. J. Johnson, R. P. Adams, et al. Patterns of scalable bayesian inference. Foundations and Trends R in Machine Learning, 9(2-3):119–247, 2016.
    Google ScholarLocate open access versionFindings
  • R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(47):1–43, 2017.
    Google ScholarLocate open access versionFindings
  • A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 1990.
    Google ScholarFindings
  • D. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
    Google ScholarLocate open access versionFindings
  • B. Bornschein and Y. Bengio. Reweighted wake-sleep. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • S. Chib, Y. Omori, and M. Asai. Multivariate Stochastic Volatility, pages 365–400. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
    Google ScholarFindings
  • A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei. Variational inference via chi upper bound minimization. In Advances in Neural Information Processing Systems 30, pages 2732–2741. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • J. Domke and D. R. Sheldon. Divide and couple: Using monte carlo variational objectives for posterior approximation. In Advances in Neural Information Processing Systems, pages 338–347, 2019.
    Google ScholarLocate open access versionFindings
  • D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
    Findings
  • A. Finke and A. H. Thiery. On importance-weighted autoencoders. arXiv:1907.10477, 2019.
    Findings
  • Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian learning. In Advances in neural information processing systems, pages 507–513, 2001.
    Google ScholarLocate open access versionFindings
  • M. G. Gu and F. H. Kong. A stochastic approximation algorithm with Markov chain Monte-Carlo method for incomplete data estimation problems. Proceedings of the National Academy of Sciences, 95(13):7270–7274, 1998.
    Google ScholarLocate open access versionFindings
  • S. S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential Monte Carlo. In Advances in Neural Information Processing Systems 28, pages 2629–2637. Curran Associates, Inc., 2015.
    Google ScholarLocate open access versionFindings
  • P. Guarniero, A. M. Johansen, and A. Lee. The iterated auxiliary particle filter. Journal of the American Statistical Association, 112(520):1636–1647, 2017.
    Google ScholarLocate open access versionFindings
  • R. Habib and D. Barber. Auxiliary variational MCMC. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • J. Heng, A. N. Bishop, G. Deligiannidis, and A. Doucet. Controlled sequential Monte Carlo. arXiv:1708.08396, 2017.
    Findings
  • M. Hoffman, P. Sountsov, J. V. Dillon, I. Langmore, D. Tran, and S. Vasudevan. Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv:1903.03704, 2019.
    Findings
  • M. D. Hoffman. Learning deep latent Gaussian models with Markov chain Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning, pages 1510–1519, 2017.
    Google ScholarLocate open access versionFindings
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, Nov. 1999.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • E. Kuhn and M. Lavielle. Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131, 2004.
    Google ScholarLocate open access versionFindings
  • H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
    Google ScholarFindings
  • D. Lawson, G. Tucker, C. A. Naesseth, C. Maddison, and Y. Whye Teh. Twisted variational sequential Monte Carlo. Third workshop on Bayesian Deep Learning (NeurIPS), 2018.
    Google ScholarFindings
  • T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Autoencoding sequential Monte Carlo. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems 29, pages 1073–1081. Curran Associates, Inc., 2016.
    Google ScholarLocate open access versionFindings
  • F. Lindsten, M. I. Jordan, and T. B. Schön. Particle Gibbs with ancestor sampling. The Journal of Machine Learning Research, 15(1):2145–2184, 2014.
    Google ScholarLocate open access versionFindings
  • F. Lindsten, J. Helske, and M. Vihola. Graphical model inference: Sequential Monte Carlo meets deterministic approximations. In Advances in Neural Information Processing Systems 31, pages 8201–8211. Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • C. J. Maddison, D. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Whye Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • T. Minka. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
    Google ScholarFindings
  • T. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.
    Google ScholarLocate open access versionFindings
  • A. K. Moretti, Z. Wang, L. Wu, I. Drori, and I. Pe’er. Particle smoothing variational objectives. arXiv:1909.09734, 2019.
    Findings
  • C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. Blei. Reparameterization gradients through acceptance-rejection sampling algorithms. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
    Google ScholarLocate open access versionFindings
  • C. A. Naesseth, S. Linderman, R. Ranganath, and D. Blei. Variational sequential Monte Carlo. In International Conference on Artificial Intelligence and Statistics, volume 84, pages 968–977. PMLR, 2018.
    Google ScholarLocate open access versionFindings
  • C. A. Naesseth, F. Lindsten, and T. B. SchÃun. Elements of sequential Monte Carlo. Foundations and TrendsÂoin Machine Learning, 12(3):307–392, 2019.
    Google ScholarLocate open access versionFindings
  • A. B. Owen. Monte Carlo theory, methods and examples. 2013.
    Google ScholarFindings
  • J. W. Paisley, D. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
    Google ScholarLocate open access versionFindings
  • D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
    Google ScholarLocate open access versionFindings
  • H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
    Google ScholarLocate open access versionFindings
  • C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2004.
    Google ScholarFindings
  • F. J. R. Ruiz and M. K. Titsias. A contrastive divergence for combining variational inference and MCMC. In Proceedings of the 36th International Conference on Machine Learning, pages 5537–5545, 2019.
    Google ScholarLocate open access versionFindings
  • F. J. R. Ruiz, M. K. Titsias, and D. Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • T. Salimans and D. A. Knowles. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837–882, 2013.
    Google ScholarLocate open access versionFindings
  • T. Salimans, D. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
    Google ScholarLocate open access versionFindings
  • R. E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series models. In D. Barber, A. T. Cemgil, and S. Chiappa, editors, Bayesian time series models, chapter 5, pages 109–130. Cambridge University Press, 2011.
    Google ScholarLocate open access versionFindings
  • Just like CIS is a straightforward modification of IS, so is CSMC a straightforward modification of SMC. We make use of CSMC with ancestor sampling as proposed by Lindsten et al. (2014) combined with twisted SMC (Guarniero et al., 2017; Heng et al., 2017; Naesseth et al., 2019). While SMC can be adapted to perform inference for almost any probabilistic model (Naesseth et al., 2019), we here focus on the state space model
    Google ScholarLocate open access versionFindings
  • ), where zt [k − 1] is the corresponding element of the conditional trajectory z1:T from the previous iteration. This is known as ancestor sampling (Lindsten et al., 2014).
    Google ScholarFindings
  • This result is an adaptation of Gu and Kong (1998, Theorem 1) based on Benveniste et al. (1990, Theorem 3.17, page
    Google ScholarLocate open access versionFindings
  • · · · Mλ(z, dz1)Mλ(z2, dz3) · · · Mλ(zk−1, dz ). |z| denotes the length of the vector z. Let Q be any compact subset of Λ, and q > 1 a sufficiently large real number such that the following assumptions hold. We follow Gu and Kong (1998) and assume: C 1. Assume that the step size sequence satisfies
    Google ScholarLocate open access versionFindings
  • With the above assumptions the result follows from Gu and Kong (1998, Theorem 1) where (left - their notation, right our notation)
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments