## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Spike and slab variational Bayes for high dimensional logistic regression

NIPS 2020, (2020)

EI

Keywords

Abstract

Variational Bayes (VB) is a popular scalable alternative to Markov chain Monte Carlo for Bayesian inference. We study a mean-field spike and slab VB approximation of widely used Bayesian model selection priors in sparse high-dimensional logistic regression. We provide non-asymptotic theoretical guarantees for the VB posterior in both $\...More

Code:

Data:

Introduction

- Let x ∈ Rp denote a feature vector and Y ∈ {0, 1} an associated binary label to be predicted.
- In Bayesian logistic regression, one assigns a prior distribution to θ, giving a probabilistic model.
- An especially natural Bayesian way to model sparsity is via a model selection prior, which assigns probabilistic weights to every potential model, i.e. every subset of {1, .
- P} corresponding to selecting the non-zero coordinates of θ ∈ Rp. An especially natural Bayesian way to model sparsity is via a model selection prior, which assigns probabilistic weights to every potential model, i.e. every subset of {1, .
- This is a widely used Bayesian approach and includes the hugely popular spike and slab prior [17, 31]

Highlights

- Let x ∈ Rp denote a feature vector and Y ∈ {0, 1} an associated binary label to be predicted
- We further demonstrate that our Variational Bayes (VB) algorithm is empirically competitive with other state-of-the-art Bayesian sparse variable selection methods for logistic regression
- We provide theoretical guarantees for the VB posterior Q∗ in (6)
- We present a coordinate-ascent variational inference (CAVI) algorithm to compute the VB posterior IQn∗troind(u6c)i.nCg obninsaidryerltahteenptrvioarri(a2b)lwesit(hzθj)jpj∼=1ii,dth(i1s−spwik)eδ0a+ndwsLlaabp(pλr)ioarnhdahsyhpieerraprrciohricwal∼repBreetsae(nat0a,tibo0n)
- This paper investigates a scalable and interpretable mean-field variational approximation of the popular spike and slab prior with Laplace slabs in high-dimensional logistic regression
- We confirm the improved performance of our VB algorithm over common sparse VB approaches in a numerical study
- The proposed approach performs comparably with other state-of-the-art sparse high-dimensional Bayesian variable selection methods for logistic regression, but scales substantially better to high-dimensional models where other approaches based on the EM algorithm or Markov chain Monte Carlo (MCMC) are not computable

Methods

**Design matrix and sparsity assumptions**

In the high-dimensional case p > n, the parameter θ in model (1) is not identifiable, let alone estimable, without additional conditions on the design matrix X.**Design matrix and sparsity assumptions**.- In the high-dimensional case p > n, the parameter θ in model (1) is not identifiable, let alone estimable, without additional conditions on the design matrix X.
- A sufficient condition for consistent estimation is ‘local invertibility’ of XT X when restricted to sparse vectors.
- Define the diagonal matrix W ∈ Rn×n with ith diagonal entry.
- P}, set κ(s) = sup Xθ X2 2 2 θ |Sθ | ≤

Results

- In Table 1 and the additional simulations in Section 8, the authors see that using Laplace slabs in the prior (9) generally outperforms the commonly used Gaussian slabs in all statistical metrics (l2-loss, MPSE, FDR), in some cases substantially so.
- This highlights the empirical advantages of using Laplace rather than Gaussian slabs for the prior underlying the VB approximation and matches the theory presented in Section 3, as well as similar observations in linear regression [42].
- The optimization routines required in Algorithm 1 mean a naive implementation can significantly increase the run-time; the authors are currently working on a more efficient implementation as an R-package sparsevb [16] that should reduce the run-time by at least an order of magnitude

Conclusion

- This paper investigates a scalable and interpretable mean-field variational approximation of the popular spike and slab prior with Laplace slabs in high-dimensional logistic regression.
- The results derived here are the first steps towards better understanding VB methods in sparse high-dimensional nonlinear models
- It opens up several interesting future lines of research for applying scalable VB implementations of spike and slab priors in complex highdimensional models, including Bayesian neural networks [39], graphical models [26] and high-dimensional Bayesian time series [44].
- Since the results have no specific applications in mind, seeking rather to explain and improve an existing method, any potential broader impact will derive from improved performance in fields where such methods are already used

Summary

## Introduction:

Let x ∈ Rp denote a feature vector and Y ∈ {0, 1} an associated binary label to be predicted.- In Bayesian logistic regression, one assigns a prior distribution to θ, giving a probabilistic model.
- An especially natural Bayesian way to model sparsity is via a model selection prior, which assigns probabilistic weights to every potential model, i.e. every subset of {1, .
- P} corresponding to selecting the non-zero coordinates of θ ∈ Rp. An especially natural Bayesian way to model sparsity is via a model selection prior, which assigns probabilistic weights to every potential model, i.e. every subset of {1, .
- This is a widely used Bayesian approach and includes the hugely popular spike and slab prior [17, 31]
## Methods:

**Design matrix and sparsity assumptions**

In the high-dimensional case p > n, the parameter θ in model (1) is not identifiable, let alone estimable, without additional conditions on the design matrix X.**Design matrix and sparsity assumptions**.- In the high-dimensional case p > n, the parameter θ in model (1) is not identifiable, let alone estimable, without additional conditions on the design matrix X.
- A sufficient condition for consistent estimation is ‘local invertibility’ of XT X when restricted to sparse vectors.
- Define the diagonal matrix W ∈ Rn×n with ith diagonal entry.
- P}, set κ(s) = sup Xθ X2 2 2 θ |Sθ | ≤
## Results:

In Table 1 and the additional simulations in Section 8, the authors see that using Laplace slabs in the prior (9) generally outperforms the commonly used Gaussian slabs in all statistical metrics (l2-loss, MPSE, FDR), in some cases substantially so.- This highlights the empirical advantages of using Laplace rather than Gaussian slabs for the prior underlying the VB approximation and matches the theory presented in Section 3, as well as similar observations in linear regression [42].
- The optimization routines required in Algorithm 1 mean a naive implementation can significantly increase the run-time; the authors are currently working on a more efficient implementation as an R-package sparsevb [16] that should reduce the run-time by at least an order of magnitude
## Conclusion:

This paper investigates a scalable and interpretable mean-field variational approximation of the popular spike and slab prior with Laplace slabs in high-dimensional logistic regression.- The results derived here are the first steps towards better understanding VB methods in sparse high-dimensional nonlinear models
- It opens up several interesting future lines of research for applying scalable VB implementations of spike and slab priors in complex highdimensional models, including Bayesian neural networks [39], graphical models [26] and high-dimensional Bayesian time series [44].
- Since the results have no specific applications in mind, seeking rather to explain and improve an existing method, any potential broader impact will derive from improved performance in fields where such methods are already used

- Table1: Comparing sparse Bayesian methods in high-dimensional logistic regression
- Table2: Marginal VB credible intervals for individual features
- Table3: Table 3
- Table4: Varying the scale hyperparameter

Funding

- Botond Szabó received funding from the Netherlands Organization for Scientific Research (NWO) under Project number: 639.031.654

Study subjects and analysis

tests cases: 4

In view of the excellent FDR control of our VB method in earlier simulations, we further investigate the performance of these marginal credible sets empirically. We consider 4 tests cases, consisting of the above example (Test 0) and Tests 1-3 from Section 8.2. In each case, we computed 95% marginal credible intervals for the coefficients, i.e. the intervals Ij , j = 1, . . . , p, of smallest length such that Q∗(θj ∈ Ij ) ≥ 0.95

Reference

- ALQUIER, P., AND RIDGWAY, J. Concentration of tempered posteriors and of their variational approximations. Ann. Statist. 48, 3 (2020), 1475–1497.
- ATCHADÉ, Y. A. On the contraction properties of some high-dimensional quasi-posterior distributions. Ann. Statist. 45, 5 (2017), 2248–2273.
- BACH, F. Self-concordant analysis for logistic regression. Electron. J. Stat. 4 (2010), 384–414.
- BANERJEE, S., CASTILLO, I., AND GHOSAL, S. Survey paper: Bayesian inference in highdimensional models.
- BHARDWAJ, S., CURTIN, R. R., EDEL, M., MENTEKIDIS, Y., AND SANDERSON, C. ensmallen: a flexible C++ library for efficient function optimization, 2018.
- BISHOP, C. M. Pattern recognition and machine learning. Information Science and Statistics.
- Springer, New York, 2006.
- [7] BLEI, D. M., KUCUKELBIR, A., AND MCAULIFFE, J. D. Variational inference: a review for statisticians. J. Amer. Statist. Assoc. 112, 518 (2017), 859–877.
- [8] BOUCHERON, S., LUGOSI, G., AND MASSART, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- [9] BÜHLMANN, P., AND VAN DE GEER, S. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011.
- [10] CARBONETTO, P., AND STEPHENS, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 1 (2012), 73–107.
- [11] CARVALHO, C. M., POLSON, N. G., AND SCOTT, J. G. The horseshoe estimator for sparse signals. Biometrika 97, 2 (2010), 465–480.
- [12] CASTILLO, I., AND ROQUAIN, E. On spike and slab empirical Bayes multiple testing. Ann. Statist. 48, 5 (2020), 2548–2574.
- [13] CASTILLO, I., SCHMIDT-HIEBER, J., AND VAN DER VAART, A. Bayesian linear regression with sparse priors. Ann. Statist. 43, 5 (2015), 1986–2018.
- [14] CASTILLO, I., AND SZABÓ, B. Spike and slab empirical Bayes sparse credible sets. Bernoulli 26, 1 (2020), 127–158.
- [15] CASTILLO, I., AND VAN DER VAART, A. Needles and straw in a haystack: posterior concentration for possibly sparse sequences. Ann. Statist. 40, 4 (2012), 2069–2101.
- [16] CLARA, G., SZABO, B., AND RAY, K. sparsevb: spike and slab variational Bayes for linear and logistic regression, 2020. R package version 1.0.
- [17] GEORGE, E. I., AND MCCULLOCH, R. E. Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 423 (1993), 881–889.
- [18] GHORBANI, B., JAVADI, H., AND MONTANARI, A. An instability in variational inference for topic models. arXiv e-prints (2018), arXiv:1802.00568.
- [19] GHOSAL, S., GHOSH, J. K., AND VAN DER VAART, A. W. Convergence rates of posterior distributions. Ann. Statist. 28, 2 (2000), 500–531.
- [20] GOODRICH, B., GABRY, J., ALI, I., AND BRILLEMAN, S. rstanarm: Bayesian applied regression modeling via Stan., 2020. R package version 2.19.3.
- [21] HOFFMAN, M. D., AND GELMAN, A. The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15 (2014), 1593–1623.
- [22] HORN, R. A., AND JOHNSON, C. R. Matrix analysis, second ed. Cambridge University Press, Cambridge, 2013.
- [23] HUANG, X., WANG, J., AND LIANG, F. A variational algorithm for Bayesian variable selection. arXiv e-prints (2016), arXiv:1602.07640.
- [24] JAAKKOLA, T. S., AND JORDAN, M. I. Bayesian parameter estimation via variational methods. Statistics and Computing 10, 1 (2000), 25–37.
- [25] LI, Y.-H., SCARLETT, J., RAVIKUMAR, P., AND CEVHER, V. Sparsistency of l1-Regularized M -Estimators. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (2015), pp. 644–652.
- [26] LI, Z. R., MCCORMICK, T. H., AND CLARK, S. J. Bayesian joint spike-and-slab graphical lasso. arXiv e-prints (2018), arXiv:1805.07051.
- [27] LIU, D. C., AND NOCEDAL, J. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (1989), 503–528.
- [28] LOGSDON, B. A., HOFFMAN, G. E., AND MEZEY, J. G. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC bioinformatics 11, 1 (2010), 58.
- [29] LU, Y., STUART, A., AND WEBER, H. Gaussian approximations for probability measures on Rd. SIAM/ASA J. Uncertain. Quantif. 5, 1 (2017), 1136–1165.
- [30] MCDERMOTT, P., SNYDER, J., AND WILLISON, R. Methods for Bayesian variable selection with binary response data using the EM algorithm. arXiv e-prints (2016), arXiv:1605.05429.
- [31] MITCHELL, T. J., AND BEAUCHAMP, J. J. Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83, 404 (1988), 1023–1036.
- [32] NARISETTY, N. N., SHEN, J., AND HE, X. Skinny Gibbs: a consistent and scalable Gibbs sampler for model selection. J. Amer. Statist. Assoc. 114, 527 (2019), 1205–1217.
- [33] NEGAHBAN, S., YU, B., WAINWRIGHT, M. J., AND RAVIKUMAR, P. K. A unified framework for high-dimensional analysis of M -estimators with decomposable regularizers. In Advances in Neural Information Processing Systems 22.
- [34] NEGAHBAN, S. N., RAVIKUMAR, P., WAINWRIGHT, M. J., AND YU, B. A unified framework for high-dimensional analysis of M -estimators with decomposable regularizers. Statist. Sci. 27, 4 (11 2012), 538–557.
- [35] NICKL, R., AND RAY, K. Nonparametric statistical inference for drift vector fields of multidimensional diffusions. Ann. Statist. 48, 3 (2020), 1383–1408.
- [36] ORMEROD, J. T., YOU, C., AND MÜLLER, S. A variational Bayes approach to variable selection. Electron. J. Stat. 11, 2 (2017), 3549–3594.
- [37] PAISLEY, J. W., BLEI, D. M., AND JORDAN, M. I. Variational Bayesian inference with stochastic search. In ICML (2012), icml.cc / Omnipress.
- [38] PATI, D., BHATTACHARYA, A., AND YANG, Y. On statistical optimality of variational Bayes. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (2018), pp. 1579–1588.
- [39] POLSON, N. G., AND ROC KOVÁ, V. Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems (2018), pp. 930–941.
- [40] RAY, K. Adaptive Bernstein–von Mises theorems in Gaussian white noise. Ann. Statist. 45, 6 (2017), 2511–2536.
- [41] RAY, K., AND SCHMIDT-HIEBER, J. Minimax theory for a class of nonlinear statistical inverse problems. Inverse Problems 32, 6 (2016), 065003, 29.
- [42] RAY, K., AND SZABO, B. Variational Bayes for high-dimensional linear regression with sparse priors. arXiv e-prints (2019), arXiv:1904.07150.
- [43] SANDERSON, C., AND CURTIN, R. Armadillo: A template-based C++ library for linear algebra. Journal of Open Source Software 1 (07 2016), 26.
- [44] SCOTT, S. L., AND VARIAN, H. R. Bayesian variable selection for nowcasting economic time series. Tech. rep., National Bureau of Economic Research, 2013.
- [45] SHETH, R., AND KHARDON, R. Excess risk bounds for the Bayes risk using variational inference in latent Gaussian models. In Advances in Neural Information Processing Systems 30. 2017, pp. 5151–5161.
- [46] SZABÓ, B., AND VAN ZANTEN, H. An asymptotic analysis of distributed nonparametric methods. Journal of Machine Learning Research 20, 87 (2019), 1–30.
- [47] TITSIAS, M., AND LAZARO-GREDILLA, M. Doubly stochastic variational Bayes for nonconjugate inference. In Proceedings of the 31st International Conference on Machine Learning (2014), pp. 1971–1979.
- [48] TITSIAS, M. K., AND LÁZARO-GREDILLA, M. Spike and slab variational inference for multi-task and multiple kernel learning. In Advances in neural information processing systems (2011), pp. 2339–2347.
- [49] VAN DER PAS, S., SZABÓ, B., AND VAN DER VAART, A. Uncertainty quantification for the horseshoe (with discussion). Bayesian Anal. 12, 4 (2017), 1221–1274. With a rejoinder by the authors.
- [50] VAN ERVEN, T., AND SZABO, B. Fast exact Bayesian inference for sparse signals in the normal sequence model. Bayesian Anal., to appear (2020).
- [51] WANG, B., AND TITTERINGTON, D. Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values. In Proceedings of the Twentieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI04) (2004), pp. 577–584.
- [52] WANG, B., AND TITTERINGTON, D. M. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In IN AISTATS05 (2004), pp. 373–380.
- [53] WANG, C., AND BLEI, D. M. Variational inference in nonconjugate models. J. Mach. Learn. Res. 14 (2013), 1005–1031.
- [54] WANG, Y., AND BLEI, D. Variational Bayes under model misspecification. In Advances in Neural Information Processing Systems 32. 2019, pp. 13357–13367.
- [55] WANG, Y., AND BLEI, D. M. Frequentist consistency of variational Bayes. J. Amer. Statist. Assoc. 114, 527 (2019), 1147–1161.
- [56] WEI, R., AND GHOSAL, S. Contraction properties of shrinkage priors in logistic regression. J. Statist. Plann. Inference 207 (2020), 215–229.
- [57] YANG, Y., PATI, D., AND BHATTACHARYA, A. α-variational inference with statistical guarantees. Ann. Statist. 48, 2 (2020), 886–905.
- [58] YI, N., TANG, Z., ZHANG, X., AND GUO, B. BhGLM: Bayesian hierarchical GLMs and survival models, with applications to genomics and epidemiology. Bioinformatics 35 8 (2019), 1419–1421.
- [59] ZHANG, A. Y., AND ZHOU, H. H. Theoretical and computational guarantees of mean field variational inference for community detection. Ann. Statist. 48, 5 (2020), 2575–2598.
- [60] ZHANG, C.-X., XU, S., AND ZHANG, J.-S. A novel variational Bayesian method for variable selection in logistic regression models. Comput. Statist. Data Anal. 133 (2019), 1–19.
- [61] ZHANG, F., AND GAO, C. Convergence rates of variational posterior distributions. Ann. Statist. 48, 4 (2020), 2180–2207.
- [62] ZHAO, P., AND YU, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 7 (2006), 2541–2563.

Tags

Comments