# Markovian Score Climbing: Variational Inference with KL(p||q)

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Modern variational inference (VI) uses stochastic gradients to avoid intractable expectations, enabling large-scale probabilistic inference in complex models. VI posits a family of approximating distributions $q$ and then finds the member of that family that is closest to the exact posterior $p$. Traditionally, VI algorithms minimize th...More

Code:

Data:

Introduction

- Variational inference (VI) is an optimization-based approach for approximate posterior inference.
- It posits a family of approximating distributions q and finds the member of that family that is closest to the exact posterior p.
- In Bayesian inference the main concern is computing the posterior distribution p(z | x), the conditional distribution of the latent variables given the observed data.
- For most models of interest, exactly computing the posterior is intractable, and the authors must approximate it.
- Minimize a metric or divergence so that the variational approximation is close to the posterior, i.e. so that q(z ; λ) ≈ p(z | x)

Highlights

- Variational inference (VI) is an optimization-based approach for approximate posterior inference
- We develop Markovian score climbing (MSC), a simple algorithm for reliably minimizing the inclusive KL
- When using gradient descent to optimize the inclusive KL we must compute an expectation of the score function s(z ; λ) eq (5) with respect to the true posterior. To avoid this intractable expectation we propose to use stochastic gradients estimated using samples generated from a Markov chain Monte Carlo (MCMC) algorithm, with the posterior as its stationary distribution
- We propose to use Markovian score climbing based on the Fisher identity of the gradient gML(θ ) = ∇θ log p(x ; θ ) = ∇θ log p(z, x ; θ ) dz
- In this paper we have argued, and illustrated numerically, that such underestimation of uncertainty can still be an issue, if the optimization is based on biased gradient estimates, as is the case for previously proposed VI algorithms
- We introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL

Methods

**Method MSC**

IS (b) Heart μ3 (e) Ionos μ17. MSC IS (c) Heart μ4 (f) Ionos μ27

bution q(z ; λ) = (z ; μ, Σ) , where Σ is a diagonal covariance matrix.- MSC IS (c) Heart μ4 (f) Ionos μ27.
- EP requires more model-specific derivations and can be difficult to implement when the moment matching subproblem can not be solved in closed form.

Results

**Additional Results Bayesian Probit Regression**

The authors compare the posterior uncertainty learnt using MSC and IS.- Figure 4 shows difference in the log-standard deviation between the posterior approximation learnt using MSC and that using IS, i.e. log σMSC − log σIS.
- The authors can see that for two of the datasets, Heart and Ionos, MSC on average learns a posterior approximation with higher uncertainty.
- To be able to exactly compute the perturbed posterior the authors keep the number of data points small n = 10.
- The authors show the true and perturbed posteriors for two randomly generated datasets with m = 2, 5, 9.
- Bias in χ-divergence variational inference (CHIVI) Figure 6 illustrates the systematic error introduced in the optimal parameters of CHIVI when using biased gradients.

Conclusion

- The properties of the approximation q, to the posterior p, depends on the choice of divergence that is minimized.
- The most common choice is the exclusive KL divergence KL (q p), which is computationally convenient, but known to suffer from underestimation of the posterior uncertainty.
- The authors introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL
- This results in a method that melds VI and MCMC.
- The authors have illustrated its convergence properties on a simple toy example, and studied its performance on Bayesian probit regression for classification as well as a stochastic volatility model for financial data

Summary

## Introduction:

Variational inference (VI) is an optimization-based approach for approximate posterior inference.- It posits a family of approximating distributions q and finds the member of that family that is closest to the exact posterior p.
- In Bayesian inference the main concern is computing the posterior distribution p(z | x), the conditional distribution of the latent variables given the observed data.
- For most models of interest, exactly computing the posterior is intractable, and the authors must approximate it.
- Minimize a metric or divergence so that the variational approximation is close to the posterior, i.e. so that q(z ; λ) ≈ p(z | x)
## Methods:

**Method MSC**

IS (b) Heart μ3 (e) Ionos μ17. MSC IS (c) Heart μ4 (f) Ionos μ27

bution q(z ; λ) = (z ; μ, Σ) , where Σ is a diagonal covariance matrix.- MSC IS (c) Heart μ4 (f) Ionos μ27.
- EP requires more model-specific derivations and can be difficult to implement when the moment matching subproblem can not be solved in closed form.
## Results:

**Additional Results Bayesian Probit Regression**

The authors compare the posterior uncertainty learnt using MSC and IS.- Figure 4 shows difference in the log-standard deviation between the posterior approximation learnt using MSC and that using IS, i.e. log σMSC − log σIS.
- The authors can see that for two of the datasets, Heart and Ionos, MSC on average learns a posterior approximation with higher uncertainty.
- To be able to exactly compute the perturbed posterior the authors keep the number of data points small n = 10.
- The authors show the true and perturbed posteriors for two randomly generated datasets with m = 2, 5, 9.
- Bias in χ-divergence variational inference (CHIVI) Figure 6 illustrates the systematic error introduced in the optimal parameters of CHIVI when using biased gradients.
## Conclusion:

The properties of the approximation q, to the posterior p, depends on the choice of divergence that is minimized.- The most common choice is the exclusive KL divergence KL (q p), which is computationally convenient, but known to suffer from underestimation of the posterior uncertainty.
- The authors introduced Markovian score climbing, a new way to reliably learn a variational approximation that minimizes the inclusive KL
- This results in a method that melds VI and MCMC.
- The authors have illustrated its convergence properties on a simple toy example, and studied its performance on Bayesian probit regression for classification as well as a stochastic volatility model for financial data

- Table1: Test error for Bayesian probit regression; lower is better. Estimated using EP (<a class="ref-link" id="cMinka_2001_a" href="#rMinka_2001_a">Minka, 2001</a>), IS (cf. <a class="ref-link" id="cBornschein_2015_a" href="#rBornschein_2015_a">Bornschein and Bengio (2015</a>)), and MSC (this paper) for 3 UCI datasets. Predictive performance is comparable between the methods

Related work

- Much recent effort in VI has focused on optimizing cost functions that are not the exclusive KL divergence. Li and Turner (2016) and Dieng et al (2017) study Rényi divergences and χ divergences, respectively. The most similar to our work are the methods by Bornschein and Bengio (2015); Gu et al (2015); Finke and Thiery (2019), using IS or SMC to optimize the inclusive KL divergence. The RWS algorithm by Bornschein and Bengio (2015) uses IS both to optimize model parameters and the variational approximation. Neural adaptive SMC by Gu et al (2015) jointly learn an approximation to the posterior and optimize the marginal likelihood of time series with gradients estimated by SMC. Finke and Thiery (2019) draw connections between importance weighted autoencoders (Burda et al, 2016), adaptive IS and methods like the RWS. These three works all rely on IS or SMC to estimate expectations with respect to the posterior. This introduces a systematic bias in the gradients that leads to a solution which is not a local optimum to the inclusive KL divergence.

Study subjects and analysis

UCI datasets: 3

. Test error for Bayesian probit regression; lower is better. Estimated using EP (Minka, 2001), IS (cf. Bornschein and Bengio (2015)), and MSC (this paper) for 3 UCI datasets. Predictive performance is comparable between the methods. MSC converges to the true solution, while the biased IS approach does not. Example of learnt variational parameters for IS- and MSC-based gradients of the inclusive KL, as well as true parameters. The example uses a Gaussian approximation to a skew normal posterior distribution. Iterations in log-scale

Reference

- C. Andrieu and M. Vihola. Markovian stochastic approximation with expanding projections. Bernoulli, 20(2), Nov. 2014.
- C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3): 269–342, 2010.
- E. Angelino, M. J. Johnson, R. P. Adams, et al. Patterns of scalable bayesian inference. Foundations and Trends R in Machine Learning, 9(2-3):119–247, 2016.
- R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(47):1–43, 2017.
- A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 1990.
- D. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- B. Bornschein and Y. Bengio. Reweighted wake-sleep. In International Conference on Learning Representations, 2015.
- Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
- S. Chib, Y. Omori, and M. Asai. Multivariate Stochastic Volatility, pages 365–400. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
- A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei. Variational inference via chi upper bound minimization. In Advances in Neural Information Processing Systems 30, pages 2732–2741. Curran Associates, Inc., 2017.
- J. Domke and D. R. Sheldon. Divide and couple: Using monte carlo variational objectives for posterior approximation. In Advances in Neural Information Processing Systems, pages 338–347, 2019.
- D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- A. Finke and A. H. Thiery. On importance-weighted autoencoders. arXiv:1907.10477, 2019.
- Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian learning. In Advances in neural information processing systems, pages 507–513, 2001.
- M. G. Gu and F. H. Kong. A stochastic approximation algorithm with Markov chain Monte-Carlo method for incomplete data estimation problems. Proceedings of the National Academy of Sciences, 95(13):7270–7274, 1998.
- S. S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential Monte Carlo. In Advances in Neural Information Processing Systems 28, pages 2629–2637. Curran Associates, Inc., 2015.
- P. Guarniero, A. M. Johansen, and A. Lee. The iterated auxiliary particle filter. Journal of the American Statistical Association, 112(520):1636–1647, 2017.
- R. Habib and D. Barber. Auxiliary variational MCMC. In International Conference on Learning Representations, 2019.
- J. Heng, A. N. Bishop, G. Deligiannidis, and A. Doucet. Controlled sequential Monte Carlo. arXiv:1708.08396, 2017.
- M. Hoffman, P. Sountsov, J. V. Dillon, I. Langmore, D. Tran, and S. Vasudevan. Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv:1903.03704, 2019.
- M. D. Hoffman. Learning deep latent Gaussian models with Markov chain Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning, pages 1510–1519, 2017.
- M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, Nov. 1999.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- E. Kuhn and M. Lavielle. Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131, 2004.
- H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
- D. Lawson, G. Tucker, C. A. Naesseth, C. Maddison, and Y. Whye Teh. Twisted variational sequential Monte Carlo. Third workshop on Bayesian Deep Learning (NeurIPS), 2018.
- T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Autoencoding sequential Monte Carlo. In International Conference on Learning Representations, 2018.
- Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems 29, pages 1073–1081. Curran Associates, Inc., 2016.
- F. Lindsten, M. I. Jordan, and T. B. Schön. Particle Gibbs with ancestor sampling. The Journal of Machine Learning Research, 15(1):2145–2184, 2014.
- F. Lindsten, J. Helske, and M. Vihola. Graphical model inference: Sequential Monte Carlo meets deterministic approximations. In Advances in Neural Information Processing Systems 31, pages 8201–8211. Curran Associates, Inc., 2018.
- C. J. Maddison, D. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Whye Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, 2017.
- T. Minka. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
- T. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.
- A. K. Moretti, Z. Wang, L. Wu, I. Drori, and I. Pe’er. Particle smoothing variational objectives. arXiv:1909.09734, 2019.
- C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. Blei. Reparameterization gradients through acceptance-rejection sampling algorithms. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
- C. A. Naesseth, S. Linderman, R. Ranganath, and D. Blei. Variational sequential Monte Carlo. In International Conference on Artificial Intelligence and Statistics, volume 84, pages 968–977. PMLR, 2018.
- C. A. Naesseth, F. Lindsten, and T. B. SchÃun. Elements of sequential Monte Carlo. Foundations and TrendsÂoin Machine Learning, 12(3):307–392, 2019.
- A. B. Owen. Monte Carlo theory, methods and examples. 2013.
- J. W. Paisley, D. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.
- R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
- D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
- H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
- C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2004.
- F. J. R. Ruiz and M. K. Titsias. A contrastive divergence for combining variational inference and MCMC. In Proceedings of the 36th International Conference on Machine Learning, pages 5537–5545, 2019.
- F. J. R. Ruiz, M. K. Titsias, and D. Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, 2016.
- T. Salimans and D. A. Knowles. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837–882, 2013.
- T. Salimans, D. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
- R. E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series models. In D. Barber, A. T. Cemgil, and S. Chiappa, editors, Bayesian time series models, chapter 5, pages 109–130. Cambridge University Press, 2011.
- Just like CIS is a straightforward modification of IS, so is CSMC a straightforward modification of SMC. We make use of CSMC with ancestor sampling as proposed by Lindsten et al. (2014) combined with twisted SMC (Guarniero et al., 2017; Heng et al., 2017; Naesseth et al., 2019). While SMC can be adapted to perform inference for almost any probabilistic model (Naesseth et al., 2019), we here focus on the state space model
- ), where zt [k − 1] is the corresponding element of the conditional trajectory z1:T from the previous iteration. This is known as ancestor sampling (Lindsten et al., 2014).
- This result is an adaptation of Gu and Kong (1998, Theorem 1) based on Benveniste et al. (1990, Theorem 3.17, page
- · · · Mλ(z, dz1)Mλ(z2, dz3) · · · Mλ(zk−1, dz ). |z| denotes the length of the vector z. Let Q be any compact subset of Λ, and q > 1 a sufficiently large real number such that the following assumptions hold. We follow Gu and Kong (1998) and assume: C 1. Assume that the step size sequence satisfies
- With the above assumptions the result follows from Gu and Kong (1998, Theorem 1) where (left - their notation, right our notation)

Full Text

Tags

Comments