Probabilistic Linear Solvers for Machine Learning

Jonathan Wenger
Jonathan Wenger

NIPS 2020, 2020.

Cited by: 0|Views9
EI
Weibo:
In the final parts of this paper, we showcased applications like kernel matrix inversion, where prior spectral information can be used for uncertainty calibration and outlined example use-cases for propagation of numerical uncertainty through computations

Abstract:

Linear systems are the bedrock of virtually all numerical computation. Machine learning poses specific challenges for the solution of such systems due to their scale, characteristic structure, stochasticity and the central role of uncertainty in the field. Unifying earlier work we propose a class of probabilistic linear solvers which jo...More
0
Full Text
Bibtex
Weibo
Introduction
  • An important example are kernel Gram matrices, which exhibit specific sparsity structure and spectral properties, depending on the kernel choice and the generative process of the data
  • Exploiting such prior information is a prime application for probabilistic linear solvers, which aim to quantify numerical uncertainty arising from limited computational resources.
  • Linear algebra for machine learning should integrate all sources of uncertainty in a computational pipeline – aleatoric, epistemic and numerical – into one coherent probabilistic framework
Highlights
  • One of the most fundamental problems in machine learning, statistics and scientific computation at large is the solution of linear systems of the form Ax∗ = b, where A ∈ Rnsy×mn is a symmetric positive definite matrix [1,2,3]
  • An important example are kernel Gram matrices, which exhibit specific sparsity structure and spectral properties, depending on the kernel choice and the generative process of the data. Exploiting such prior information is a prime application for probabilistic linear solvers, which aim to quantify numerical uncertainty arising from limited computational resources
  • We proposed first principles to constrain the space of possible generative models and derived a suitable covariance class
  • We identified parameter choices that recover the iterates of conjugate gradients in the mean, but add calibrated u u2 (a) Ground truth (b) Solution mean & samples u − E[u]
  • In the final parts of this paper, we showcased applications like kernel matrix inversion, where prior spectral information can be used for uncertainty calibration and outlined example use-cases for propagation of numerical uncertainty through computations
  • The matrix-based view of probabilistic linear solvers could inform probabilistic approaches to matrix decompositions, analogous to the way Lanczos methods are used in the classical setting
Methods
  • The authors demonstrate the functionality of Algorithm 1.
  • When using a probabilistic linear solver for this task, the authors can quantify the uncertainty arising from finite computation as well as the belief of the solver about the shape of the GP at a set of not yet computed inputs.
  • In large-scale applications, the authors can trade off computational expense for increased uncertainty arising from the numerical approximation and quantified by the probabilistic linear solver.
  • By assessing the numerical uncertainty arising from not exploring the full space, the authors can judge the quality of the estimated GP mean and marginal variance
Conclusion
  • The authors condensed a line of previous research on probabilistic linear algebra into a selfcontained algorithm for the solution of linear problems in machine learning.
  • The authors' proposed framework incorporates prior knowledge on the system matrix or its inverse and performs inference for both in a consistent fashion.
  • In the final parts of this paper, the authors showcased applications like kernel matrix inversion, where prior spectral information can be used for uncertainty calibration and outlined example use-cases for propagation of numerical uncertainty through computations.
  • While the theoretical framework can incorporate noisy matrix-vector product evaluations into its inference procedure via a Gaussian likelihood, practically tractable inference in the inverse model is more challenging.
  • The matrix-based view of probabilistic linear solvers could inform probabilistic approaches to matrix decompositions, analogous to the way Lanczos methods are used in the classical setting
Summary
  • Introduction:

    An important example are kernel Gram matrices, which exhibit specific sparsity structure and spectral properties, depending on the kernel choice and the generative process of the data
  • Exploiting such prior information is a prime application for probabilistic linear solvers, which aim to quantify numerical uncertainty arising from limited computational resources.
  • Linear algebra for machine learning should integrate all sources of uncertainty in a computational pipeline – aleatoric, epistemic and numerical – into one coherent probabilistic framework
  • Methods:

    The authors demonstrate the functionality of Algorithm 1.
  • When using a probabilistic linear solver for this task, the authors can quantify the uncertainty arising from finite computation as well as the belief of the solver about the shape of the GP at a set of not yet computed inputs.
  • In large-scale applications, the authors can trade off computational expense for increased uncertainty arising from the numerical approximation and quantified by the probabilistic linear solver.
  • By assessing the numerical uncertainty arising from not exploring the full space, the authors can judge the quality of the estimated GP mean and marginal variance
  • Conclusion:

    The authors condensed a line of previous research on probabilistic linear algebra into a selfcontained algorithm for the solution of linear problems in machine learning.
  • The authors' proposed framework incorporates prior knowledge on the system matrix or its inverse and performs inference for both in a consistent fashion.
  • In the final parts of this paper, the authors showcased applications like kernel matrix inversion, where prior spectral information can be used for uncertainty calibration and outlined example use-cases for propagation of numerical uncertainty through computations.
  • While the theoretical framework can incorporate noisy matrix-vector product evaluations into its inference procedure via a Gaussian likelihood, practically tractable inference in the inverse model is more challenging.
  • The matrix-based view of probabilistic linear solvers could inform probabilistic approaches to matrix decompositions, analogous to the way Lanczos methods are used in the classical setting
Tables
  • Table1: Desired properties of probabilistic linear solvers. Symbols ( , ∼, ) indicate which properties are encoded in our proposed solver (see Algorithm 1) and to what degree
  • Table2: Uncertainty calibration for kernel matrices
Download tables as Excel
Related work
  • Numerical methods for the solution of linear systems have been studied in great detail since the last century. Standard texts [1, 2, 10, 3] give an in-depth overview. The conjugate gradient method recovered by our algorithm for a specific choice of prior was introduced by Hestenes and Stiefel [25]. Recently, randomization has been exploited to develop improved algorithms for large-scale problems arising from machine learning [26, 27]. The key difference to our approach is that we do not rely on sampling to approximate large-scale matrices, but instead perform probabilistic inference. Our approach is based on the framework of probabilistic numerics [14, 15] and is a natural continuation of previous work on probabilistic linear solvers. In historical order, Hennig and Kiefel [18] provided a probabilistic interpretation of Quasi-Newton methods, which was expanded upon in [11]. This work also relied on the symmetric matrix-variate Gaussian as used in our paper. Bartels and Hennig [28] estimate numerical error in approximate least-squares solutions by using a probabilistic model. More recently, Cockayne et al [19] proposed a Bayesian conjugate gradient method performing inference on the solution of the system. This was connected to the matrix-based view by Bartels et al [13].
Funding
  • Acknowledgments and Disclosure of Funding The authors gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence “Machine Learning - New Perspectives for Science”, EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-Württemberg
Study subjects and analysis
data: 16
Rayleigh regression. Uncertainty calibration via GP regression on {ln R(A, si)}ki=1 after k = 91 iterations of Algorithm 1 on an n = 1000 dimensional Mátern32 kernel matrix inversion problem. The degrees of freedom φ = ψ−1 > 0 are set based on the average predicted Rayleigh quotient for the remaining n − k = 909 dimensions. Numerical uncertainty in GP inference. Computing posterior mean and covariance of a GP regression using a PLS. Top: GP mean for a toy data set (n = 16) computed with increasing number of iterations k of Algorithm 1. The numerical estimate of the GP mean approaches the true mean. Note that the numerical variance is different from the marginal variance of the GP. Bottom. Solving the Dirichlet problem with a probabilistic linear solver. Figures 4a and 4b show the ground truth and mean of the solution computed with Algorithm 1 after k = 21 iterations along with samples from the posterior. The posterior on the coarse mesh can be used to assess uncertainty about the solution on a finer mesh. The signed error computed on the coarse mesh in Figure 4c shows that the approximation is better on the boundary of Ω. In case of perfect uncertainty calibration, Figure 4d represents a sample from N (0, I). The apparent structure in the plot and smaller than expected deviations near the boundary indicate the conservative confidence estimate of the solver

iterations along with samples: 21
Numerical uncertainty in GP inference. Computing posterior mean and covariance of a GP regression using a PLS. Top: GP mean for a toy data set (n = 16) computed with increasing number of iterations k of Algorithm 1. The numerical estimate of the GP mean approaches the true mean. Note that the numerical variance is different from the marginal variance of the GP. Bottom. Solving the Dirichlet problem with a probabilistic linear solver. Figures 4a and 4b show the ground truth and mean of the solution computed with Algorithm 1 after k = 21 iterations along with samples from the posterior. The posterior on the coarse mesh can be used to assess uncertainty about the solution on a finer mesh. The signed error computed on the coarse mesh in Figure 4c shows that the approximation is better on the boundary of Ω. In case of perfect uncertainty calibration, Figure 4d represents a sample from N (0, I). The apparent structure in the plot and smaller than expected deviations near the boundary indicate the conservative confidence estimate of the solver.

Reference
  • Youcef Saad. Numerical methods for large eigenvalue problems. Manchester University Press, 1992.
    Google ScholarFindings
  • Lloyd N. Trefethen and David Bau. Numerical Linear Algebra. Society for Industrial and Applied Mathematics, 1997.
    Google ScholarLocate open access versionFindings
  • Gene H. Golub and Charles F. van Loan. Matrix Computations. JHU Press, fourth edition, 2013.
    Google ScholarFindings
  • Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, 2006.
    Google ScholarFindings
  • Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods in machine learning. The Annals of Statistics, pages 1171–1220, 2008.
    Google ScholarLocate open access versionFindings
  • Rudolph E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
    Google ScholarFindings
  • Fan R. K. Chung. Spectral graph theory. American Mathematical Society, 1997.
    Google ScholarLocate open access versionFindings
  • Clive A. J. Fletcher. Computational Galerkin methods. Springer, 1984.
    Google ScholarFindings
  • Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
    Google ScholarFindings
  • Philipp Hennig. Probabilistic interpretation of linear solvers. SIAM Journal on Optimization, 25(1):234–260, 2015.
    Google ScholarLocate open access versionFindings
  • Jon Cockayne, Chris Oates, Tim J. Sullivan, and Mark Girolami. Bayesian probabilistic numerical methods. SIAM Review, 61(4):756–789, 2019.
    Google ScholarLocate open access versionFindings
  • Simon Bartels, Jon Cockayne, Ilse C. Ipsen, and Philipp Hennig. Probabilistic linear solvers: A unifying view. Statistics and Computing, 29(6):1249–1263, 2019.
    Google ScholarLocate open access versionFindings
  • Philipp Hennig, Mike A. Osborne, and Mark Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 471(2179), 2015.
    Google ScholarLocate open access versionFindings
  • Chris Oates and Tim J. Sullivan. A modern retrospective on probabilistic numerics. Statistics and Computing, 10 2019.
    Google ScholarLocate open access versionFindings
  • Paul Lévy. Calcul des probabilités. J. Gabay, 1925.
    Google ScholarLocate open access versionFindings
  • Charles F. Van Loan. The ubiquitous Kronecker product. Journal of Computational and Applied Mathematics, 123(1-2):85–100, 2000.
    Google ScholarLocate open access versionFindings
  • Philipp Hennig and Martin Kiefel. Quasi-Newton method: A new direction. Journal of Machine Learning Research, 14(Mar):843–865, 2013.
    Google ScholarLocate open access versionFindings
  • Jon Cockayne, Chris Oates, Ilse C. Ipsen, and Mark Girolami. A Bayesian conjugate gradient method. Bayesian Analysis, 14(3):937–1012, 2019.
    Google ScholarLocate open access versionFindings
  • Matthias Seeger. Low rank updates for the Cholesky decomposition. Technical report, University of California at Berkeley, 2008.
    Google ScholarFindings
  • David G. Luenberger. Introduction to Linear and Nonlinear Programming. Addison-Wesley Publishing Company, 1973.
    Google ScholarFindings
  • Christopher C. Paige. Computational variants of the Lanczos method for the eigenproblem. IMA Journal of Applied Mathematics, 10(3):373–381, 1972.
    Google ScholarLocate open access versionFindings
  • Horst D. Simon. Analysis of the symmetric Lanczos algorithm with reorthogonalization methods. Linear algebra and its applications, 61:101–131, 1984.
    Google ScholarLocate open access versionFindings
  • Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Government Press Office Los Angeles, CA, 1950.
    Google ScholarFindings
  • Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49, 1952.
    Google ScholarLocate open access versionFindings
  • Petros Drineas and Michael W. Mahoney. RandNLA: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016.
    Google ScholarLocate open access versionFindings
  • Alex Gittens and Michael W. Mahoney. Revisiting the Nyström method for improved large-scale machine learning. Journal of Machine Learning Research, 17(1):3977–4041, January 2016.
    Google ScholarLocate open access versionFindings
  • Simon Bartels and Philipp Hennig. Probabilistic approximate least-squares. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 51 of Proceedings of Machine Learning Research, pages 676–684, Cadiz, Spain, 09–11 May 2016. PMLR.
    Google ScholarLocate open access versionFindings
  • John E. Dennis, Jr and Jorge J. Moré. Quasi-Newton methods, motivation and theory. SIAM review, 19(1):46–89, 1977.
    Google ScholarLocate open access versionFindings
  • Hermann Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912.
    Google ScholarLocate open access versionFindings
  • US Department of Transportation. Airline on-time performance data. https://www.transtats.bts.gov/, 2020. Accessed:2020-05-26.
    Findings
  • Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. Reviews of Modern Physics, 91(4):045002, 2019.
    Google ScholarLocate open access versionFindings
  • Martin Alnæs, Jan Blechta, Johan Hake, August Johansson, Benjamin Kehlet, Anders Logg, Chris Richardson, Johannes Ring, Marie E. Rognes, and Garth N. Wells. The FEniCS project version 1.5. Archive of Numerical Software, 3(100), 2015.
    Google ScholarLocate open access versionFindings
  • Michael L. Parks, Eric De Sturler, Greg Mackey, Duane D. Johnson, and Spandan Maiti. Recycling Krylov subspaces for sequences of linear systems. SIAM Journal on Scientific Computing, 28(5):1651–1674, 2006.
    Google ScholarLocate open access versionFindings
  • Filip de Roos and Philipp Hennig. Krylov subspace recycling for fast iterative least-squares in machine learning. arXiv pre-print, 2017. URL http://arxiv.org/abs/1706.00241.
    Findings
Your rating :
0

 

Tags
Comments