AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In Section 4.2 we show that a simple relation between δCV and the evidence lower bound is sufficient to guarantee that gVarGrad has lower variance than gReinforce when the number of Monte Carlo samples is large enough

VarGrad: A Low-Variance Gradient Estimator for Variational Inference

NIPS 2020, (2020)

Cited by: 0|Views51
EI
Full Text
Bibtex
Weibo

Abstract

We analyse the properties of an unbiased gradient estimator of the ELBO for variational inference, based on the score function method with leave-one-out control variates. We show that this gradient estimator can be obtained using a new loss, defined as the variance of the log-ratio between the exact posterior and the variational approxi...More
0
Introduction
  • Estimating the gradient of the expectation of a function is a problem with applications in many areas of machine learning, ranging from variational inference to reinforcement learning [Mohamed et al, 2019].
  • Variational inference finds the parameters φ∗ that minimise the KL divergence, φ∗ = argminφ∈Φ KL (qφ(z) || p(z | x))
  • This optimisation problem is intractable because the KL itself depends on the intractable posterior.
  • As the expectation in Eq 1 is typically intractable, variational inference uses stochastic optimisation to maximise the ELBO
  • It forms unbiased Monte Carlo estimators of the gradient ∇φELBO(φ)
Highlights
  • Estimating the gradient of the expectation of a function is a problem with applications in many areas of machine learning, ranging from variational inference to reinforcement learning [Mohamed et al, 2019]
  • We focus on variational inference (VI), where the goal is to approximate the posterior distribution p(z | x) of a model p(x, z), where x denotes the observations and z refers to the latent variables of the model [Jordan et al, 1999, Blei et al, 2017]
  • Variational inference finds the parameters φ∗ that minimise the KL divergence, φ∗ = argminφ∈Φ KL (qφ(z) || p(z | x)). This optimisation problem is intractable because the KL itself depends on the intractable posterior. Variational inference sidesteps this problem by maximising instead the evidence lower bound (ELBO) defined in Eq 1, which is a lower bound on the marginal likelihood, since log p(x) = ELBO(φ) + KL (qφ(z) || p(z | x))
  • We review the score function method, a Monte Carlo estimator commonly used in variational inference
  • In Section 4.2 we show that a simple relation between δCV and the ELBO is sufficient to guarantee that gVarGrad has lower variance than gReinforce when the number of Monte Carlo samples is large enough
  • We have showed theoretically that, under certain conditions, the VarGrad control variate coefficients are close to the optimal ones
Methods
  • In order to verify the properties of VarGrad empirically, the authors test it on two popular models: a Bayesian logistic regression model on a synthetic dataset and a discrete variational autoencoder (DVAE) [Salakhutdinov and Murray, 2008, Kingma and Welling, 2014] on a fixed binarisation of Omniglot [Lake et al, 2015].
  • In Section 4 the authors analytically showed that VarGrad is close to the optimal control variate, and in particular that the ratio δiCV/Eqφ [aVarGrad] can be small over the whole optimisation procedure.
  • This behaviour is expected to be even more pronounced with growing dimensionality of the latent space.
Results
  • The authors study the properties of gVarGrad in comparison to other estimators based on the score function method.
  • In Section 4.1, the authors analyse the difference δCV between the control variate coefficient of VarGrad and the optimal one.
  • The former can be approximated cheaply and unbiasedly, while a standard Monte Carlo estimator of the latter is biased and often exhibits high variance.
  • The proportionality relation the coefficients of the optimal a∗ are given by is
Conclusion
  • The authors have analysed the VarGrad estimator, an estimator of the gradient of the KL that is based on Reinforce with leave-one-out control variates, which was first introduced by Salimans and Knowles [2014] and Kool et al [2019].
  • The authors have established the connection between VarGrad and a novel divergence, which the authors call the log-variance loss.
  • The authors have established the conditions that guarantee that VarGrad exhibits lower variance than Reinforce.
  • The authors leave it for future work to explore the direct optimisation of the log-variance loss for alternative choices of the reference distribution r(z)
Related work
Funding
  • B. are funded by the Lloyds Register Foundation programme on Data Centric Engineering through the London Air Quality project at The Alan Turing Institute for Data Science and AI
  • This work was supported under EPSRC grant EP/N510129/1 as well as by Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1114 ‘Scaling Cascades in Complex Systems’ (projects A02 and A05, project number 235221301)
Study subjects and analysis
guarantees: 3
See Appendix A.4. If the correction δCV is negligible in the sense of Proposition 2, then the assumption in Eq 19 is satisfied and Proposition 3 guarantees that VarGrad has lower variance than Reinforce when S is large enough. We arrive at the following corollary, which also considers the dimensionality of the latent variables

Monte Carlo samples: 2000
associated with the biases of two models with latent dimensions trained on Omniglot using VarGrad. The estimates are obtained with 2,000 Monte Carlo samples. The ratio δiCV E[aVarGrad ]

Monte Carlo samples: 4
In fact, we observe a small difference between the variance of VarGrad and the variance of an oracle estimator based on Reinforce with access to the optimal control variate coefficient a∗. Figure 3 also shows the variance of the sampled estimator, which is based on Reinforce with an estimate of the optimal control variate; this confirms the difficulty of estimating it in practice. (A similar trend can be observed for the DVAE in the results in Appendix B, where VarGrad is compared to a wider list of estimators from the DVAE literature.) All methods use S = 4 Monte Carlo samples, and the control variate coefficient is estimated with either 2 extra samples (sampled estimator) or 1,000 samples (oracle estimator). Finally, Figure 4 compares VarGad with other estimators by training a DVAE on Omniglot

Reference
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
    Google ScholarLocate open access versionFindings
  • J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
    Findings
  • P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in probabilistic graphical models. In Advances in Neural Information Processing Systems, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Cong, M. Zhao, K. Bai, and L. Carin. GO gradient for expectation-based objectives. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
    Google ScholarFindings
  • A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. M. Blei. Variational inference via χ-upper bound minimization. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Z. Dong, A. Mnih, and G. Tucker. DisARM: An antithetic gradient estimator for binary latent variables. In Advances in Neural Information Processing Systems, 2020.
    Google ScholarLocate open access versionFindings
  • S. Ghosal, J. K. Ghosh, A. W. Van Der Vaart, et al. Convergence rates of posterior distributions. Annals of Statistics, 28(2):500–531, 2000.
    Google ScholarLocate open access versionFindings
  • W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • S. Gu, S. Levine, I. Sutskever, and A. Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku.
    Findings
  • Y. Hernández-Lobato, J. M. amd Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R. E. Turner. Black-box α-divergence minimization. In International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-softmax. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, Nov. 1999.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • W. Kool, H. van Hoof, and M. Welling. Buy 4 REINFORCE samples, get a baseline for free! In ICLR Workshop on Deep Reinforcement Learning Meets Structured Prediction, 2019.
    Google ScholarLocate open access versionFindings
  • W. Kool, H. van Hoof, and M. Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
    Google ScholarLocate open access versionFindings
  • W. Lee, H. Yu, and H. Yang. Reparameterization gradient for non-differentiable models. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
    Google ScholarLocate open access versionFindings
  • A. Mnih and D. J. Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. Monte Carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652, 2019.
    Findings
  • C. Naesseth, F. J. R. Ruiz, S. Linderman, and D. M. Blei. Reparameterization gradients through acceptance-rejection methods. In Artificial Intelligence and Statistics, 2017.
    Google ScholarLocate open access versionFindings
  • C. A. Naesseth, F. Lindsten, and D. M. Blei. Markovian score climbing: Variational inference with kl(p||q). arXiv preprint arXiv:2003.10374, 2020.
    Findings
  • N. Nüsken and L. Richter. Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. arXiv preprint arXiv:2005.05409, 2020.
    Findings
  • J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • J. W. T. Peters and M. Welling. Probabilistic binary neural networks. arXiv preprint arXiv:1809.03368, 2018.
    Findings
  • R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
    Google ScholarLocate open access versionFindings
  • R. Ranganath, J. Altosaar, D. Tran, and D. M. Blei. Operator variational inference. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • R.-D. Reiss. Approximate distributions of order statistics: with applications to nonparametric statistics. Springer science & business media, 2012.
    Google ScholarFindings
  • D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
    Google ScholarLocate open access versionFindings
  • H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
    Google ScholarLocate open access versionFindings
  • F. J. R. Ruiz and M. K. Titsias. A contrastive divergence for combining variational inference and MCMC. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • F. J. R. Ruiz, M. K. Titsias, and D. M. Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879, 2008.
    Google ScholarLocate open access versionFindings
  • T. Salimans and D. A. Knowles. On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression. arXiv preprint arXiv:1401.1022, 2014.
    Findings
  • O. Shayer, D. Levi, and E. Fetaya. Learning discrete weights using the local reparameterization trick. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • M. K. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
    Google ScholarLocate open access versionFindings
  • G. Tucker, A. Mnih, C. J. Maddison, and J. Sohl-Dickstein. REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • D. Wang, H. Liu, and Q. Liu. Variational inference with tail-adaptive f -divergence. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • M. Yin and M. Zhou. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • M. Yin, Y. Yue, and M. Zhou. ARSM: Augment-REINFORCE-swap-merge estimator for gradient backpropagation through categorical variables. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
Author
Lorenz Richter
Lorenz Richter
Ayman Boustati
Ayman Boustati
Nikolas Nüsken
Nikolas Nüsken
Francisco Ruiz
Francisco Ruiz
Omer Deniz Akyildiz
Omer Deniz Akyildiz
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科