AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
In Section 4.2 we show that a simple relation between δCV and the evidence lower bound is sufficient to guarantee that gVarGrad has lower variance than gReinforce when the number of Monte Carlo samples is large enough
VarGrad: A Low-Variance Gradient Estimator for Variational Inference
NIPS 2020, (2020)
We analyse the properties of an unbiased gradient estimator of the ELBO for variational inference, based on the score function method with leave-one-out control variates. We show that this gradient estimator can be obtained using a new loss, defined as the variance of the log-ratio between the exact posterior and the variational approxi...More
PPT (Upload PPT)
- Estimating the gradient of the expectation of a function is a problem with applications in many areas of machine learning, ranging from variational inference to reinforcement learning [Mohamed et al, 2019].
- Variational inference finds the parameters φ∗ that minimise the KL divergence, φ∗ = argminφ∈Φ KL (qφ(z) || p(z | x))
- This optimisation problem is intractable because the KL itself depends on the intractable posterior.
- As the expectation in Eq 1 is typically intractable, variational inference uses stochastic optimisation to maximise the ELBO
- It forms unbiased Monte Carlo estimators of the gradient ∇φELBO(φ)
- Estimating the gradient of the expectation of a function is a problem with applications in many areas of machine learning, ranging from variational inference to reinforcement learning [Mohamed et al, 2019]
- We focus on variational inference (VI), where the goal is to approximate the posterior distribution p(z | x) of a model p(x, z), where x denotes the observations and z refers to the latent variables of the model [Jordan et al, 1999, Blei et al, 2017]
- Variational inference finds the parameters φ∗ that minimise the KL divergence, φ∗ = argminφ∈Φ KL (qφ(z) || p(z | x)). This optimisation problem is intractable because the KL itself depends on the intractable posterior. Variational inference sidesteps this problem by maximising instead the evidence lower bound (ELBO) defined in Eq 1, which is a lower bound on the marginal likelihood, since log p(x) = ELBO(φ) + KL (qφ(z) || p(z | x))
- We review the score function method, a Monte Carlo estimator commonly used in variational inference
- In Section 4.2 we show that a simple relation between δCV and the ELBO is sufficient to guarantee that gVarGrad has lower variance than gReinforce when the number of Monte Carlo samples is large enough
- We have showed theoretically that, under certain conditions, the VarGrad control variate coefficients are close to the optimal ones
- In order to verify the properties of VarGrad empirically, the authors test it on two popular models: a Bayesian logistic regression model on a synthetic dataset and a discrete variational autoencoder (DVAE) [Salakhutdinov and Murray, 2008, Kingma and Welling, 2014] on a fixed binarisation of Omniglot [Lake et al, 2015].
- In Section 4 the authors analytically showed that VarGrad is close to the optimal control variate, and in particular that the ratio δiCV/Eqφ [aVarGrad] can be small over the whole optimisation procedure.
- This behaviour is expected to be even more pronounced with growing dimensionality of the latent space.
- The authors study the properties of gVarGrad in comparison to other estimators based on the score function method.
- In Section 4.1, the authors analyse the difference δCV between the control variate coefficient of VarGrad and the optimal one.
- The former can be approximated cheaply and unbiasedly, while a standard Monte Carlo estimator of the latter is biased and often exhibits high variance.
- The proportionality relation the coefficients of the optimal a∗ are given by is
- The authors have analysed the VarGrad estimator, an estimator of the gradient of the KL that is based on Reinforce with leave-one-out control variates, which was first introduced by Salimans and Knowles  and Kool et al .
- The authors have established the connection between VarGrad and a novel divergence, which the authors call the log-variance loss.
- The authors have established the conditions that guarantee that VarGrad exhibits lower variance than Reinforce.
- The authors leave it for future work to explore the direct optimisation of the log-variance loss for alternative choices of the reference distribution r(z)
- In the last few years, many gradient estimators of the ELBO have been proposed; see Mohamed et al  for a comprehensive review. Among those, the score function estimators [Williams, 1992, Carbonetto et al, 2009, Paisley et al, 2012, Ranganath et al, 2014] and the reparameterisation estimators [Kingma and Welling, 2014, Rezende et al, 2014, Titsias and Lázaro-Gredilla, 2014], as well as combinations of both [Ruiz et al, 2016, Naesseth et al, 2017], are arguably the most widely used. NVIL [Mnih and Gregor, 2014] and MuProp [Gu et al, 2016] are unbiased gradient estimators for training stochastic neural networks.
Other gradient estimators are specific for discrete-valued latent variables. The concrete relaxation [Maddison et al, 2017, Jang et al, 2017] described a way to form a biased estimator of the gradient, which REBAR [Tucker et al, 2017] and RELAX [Grathwohl et al, 2018] use as a control variate to obtain an unbiased estimator. Other recent estimators have been proposed by Lee et al , Peters and Welling , Shayer et al , Cong et al , Yin and Zhou , Yin et al , and Dong et al . In Section 6, we compare VarGrad with some of these estimators, showing that it exhibits a favourable performance versus computational complexity trade-off.
- B. are funded by the Lloyds Register Foundation programme on Data Centric Engineering through the London Air Quality project at The Alan Turing Institute for Data Science and AI
- This work was supported under EPSRC grant EP/N510129/1 as well as by Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1114 ‘Scaling Cascades in Complex Systems’ (projects A02 and A05, project number 235221301)
Study subjects and analysis
See Appendix A.4. If the correction δCV is negligible in the sense of Proposition 2, then the assumption in Eq 19 is satisfied and Proposition 3 guarantees that VarGrad has lower variance than Reinforce when S is large enough. We arrive at the following corollary, which also considers the dimensionality of the latent variables
Monte Carlo samples: 2000
associated with the biases of two models with latent dimensions trained on Omniglot using VarGrad. The estimates are obtained with 2,000 Monte Carlo samples. The ratio δiCV E[aVarGrad ]
Monte Carlo samples: 4
In fact, we observe a small difference between the variance of VarGrad and the variance of an oracle estimator based on Reinforce with access to the optimal control variate coefficient a∗. Figure 3 also shows the variance of the sampled estimator, which is based on Reinforce with an estimate of the optimal control variate; this confirms the difficulty of estimating it in practice. (A similar trend can be observed for the DVAE in the results in Appendix B, where VarGrad is compared to a wider list of estimators from the DVAE literature.) All methods use S = 4 Monte Carlo samples, and the control variate coefficient is estimated with either 2 extra samples (sampled estimator) or 1,000 samples (oracle estimator). Finally, Figure 4 compares VarGad with other estimators by training a DVAE on Omniglot
- D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in probabilistic graphical models. In Advances in Neural Information Processing Systems, 2009.
- Y. Cong, M. Zhao, K. Bai, and L. Carin. GO gradient for expectation-based objectives. In International Conference on Learning Representations, 2019.
- T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
- A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. M. Blei. Variational inference via χ-upper bound minimization. In Advances in Neural Information Processing Systems, 2017.
- Z. Dong, A. Mnih, and G. Tucker. DisARM: An antithetic gradient estimator for binary latent variables. In Advances in Neural Information Processing Systems, 2020.
- S. Ghosal, J. K. Ghosh, A. W. Van Der Vaart, et al. Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531, 2000.
- W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
- S. Gu, S. Levine, I. Sutskever, and A. Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Machine Learning, 2016.
- T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku.
- Y. Hernández-Lobato, J. M. amd Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R. E. Turner. Black-box α-divergence minimization. In International Conference on Machine Learning, 2016.
- E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-softmax. In International Conference on Learning Representations, 2017.
- M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, Nov. 1999.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- W. Kool, H. van Hoof, and M. Welling. Buy 4 REINFORCE samples, get a baseline for free! In ICLR Workshop on Deep Reinforcement Learning Meets Structured Prediction, 2019.
- W. Kool, H. van Hoof, and M. Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020.
- B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- W. Lee, H. Yu, and H. Yang. Reparameterization gradient for non-differentiable models. In Advances in Neural Information Processing Systems, 2018.
- Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, 2016.
- C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.
- A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
- A. Mnih and D. J. Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, 2016.
- S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. Monte Carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652, 2019.
- C. Naesseth, F. J. R. Ruiz, S. Linderman, and D. M. Blei. Reparameterization gradients through acceptance-rejection methods. In Artificial Intelligence and Statistics, 2017.
- C. A. Naesseth, F. Lindsten, and D. M. Blei. Markovian score climbing: Variational inference with kl(p||q). arXiv preprint arXiv:2003.10374, 2020.
- N. Nüsken and L. Richter. Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. arXiv preprint arXiv:2005.05409, 2020.
- J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.
- J. W. T. Peters and M. Welling. Probabilistic binary neural networks. arXiv preprint arXiv:1809.03368, 2018.
- R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
- R. Ranganath, J. Altosaar, D. Tran, and D. M. Blei. Operator variational inference. In Advances in Neural Information Processing Systems, 2016.
- R.-D. Reiss. Approximate distributions of order statistics: with applications to nonparametric statistics. Springer science & business media, 2012.
- D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
- H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- F. J. R. Ruiz and M. K. Titsias. A contrastive divergence for combining variational inference and MCMC. In International Conference on Machine Learning, 2019.
- F. J. R. Ruiz, M. K. Titsias, and D. M. Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, 2016.
- R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879, 2008.
- T. Salimans and D. A. Knowles. On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression. arXiv preprint arXiv:1401.1022, 2014.
- O. Shayer, D. Levi, and E. Fetaya. Learning discrete weights using the local reparameterization trick. In International Conference on Learning Representations, 2018.
- M. K. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
- G. Tucker, A. Mnih, C. J. Maddison, and J. Sohl-Dickstein. REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. In International Conference on Learning Representations, 2017.
- D. Wang, H. Liu, and Q. Liu. Variational inference with tail-adaptive f -divergence. In Advances in Neural Information Processing Systems, 2018.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992.
- M. Yin and M. Zhou. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019.
- M. Yin, Y. Yue, and M. Zhou. ARSM: Augment-REINFORCE-swap-merge estimator for gradient backpropagation through categorical variables. In International Conference on Machine Learning, 2019.