AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We provide theoretical guarantees for the minibatch Stochastic gradient descent for training the Gaussian process model

Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes

NIPS 2020, (2020)

被引用0|浏览10
EI
下载 PDF 全文
引用
微博一下

摘要

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient w...更多

代码

数据

0
简介
  • The Gaussian process (GP) has seen many success stories in various domains, be it in optimization [42, 32], reinforcement learning [33, 20], time series analysis [19, 1], control theory [17, 23] and simulation meta-modeling [44, 26].
  • As a result, during the past two decades, a large proportion of papers on GPs tackled approximate inference procedures to reduce the computational demands and numerical instabilities.
  • This push towards scalability dates back to the seminal paper by Quiñonero-Candela and Rasmussen [27] in 2005 which unified previous approximation methods into a single probabilistic framework based on inducing points.
  • This recent literature inlcudes distributed Cholesky factorizations [24], preconditioned
重点内容
  • The Gaussian process (GP) has seen many success stories in various domains, be it in optimization [42, 32], reinforcement learning [33, 20], time series analysis [19, 1], control theory [17, 23] and simulation meta-modeling [44, 26]
  • We prove that the conditional expectation of the loss function given covariates Xn satisfies a relaxed property of strong convexity, which provides more flexibility in the choice of initial parameters
  • Our experiments indicate that stochastic gradient-based GP (sgGP) is able to attain preferable hyperparameters to exact GP (EGP) at a much lower computational cost
  • We provide theoretical guarantees for the minibatch Stochastic gradient descent (SGD) for training the Gaussian process (GP) model
  • We prove that the parameter iterates converge to the true hyperparameters and the critical point of the full loss function, with rate up to a statistical error term depending on minibatch size
  • Given the correlation structure of GPs, the challenge lies in the bias of stochastic gradient when taking expectation w.r.t. random sampling
结果
  • 7.1 Numerical Illustration of Theory

    the authors conduct simulation studies to verify the theoretical results.
  • 7.1 Numerical Illustration of Theory.
  • The authors conduct simulation studies to verify the theoretical results.
  • The authors consider n = 1024, xi i.∼i.d. N (0, 52) and yn ∼ N (0, σf2Kf + σ2In).
  • The authors perform 25 epochs of minibatch SGD updates with diminishing step sizes αk = α1/k.
  • Notice that similar numerical results can be obtained by sampling minibatches with replacement.
  • Each experiment is repeated 10 times with independent data pools
结论
  • The authors provide theoretical guarantees for the minibatch SGD for training the Gaussian process (GP) model.
  • Given the correlation structure of GPs, the challenge lies in the bias of stochastic gradient when taking expectation w.r.t. random sampling.
  • Numerical studies support the theoretical results and show that minibatch SGD has better performance than some state-of-the-art methods for various datasets while enjoying huge computational benefits.
  • The authors note that investigating variance reduction techniques in correlated settings might be a promising direction to explore.
总结
  • Introduction:

    The Gaussian process (GP) has seen many success stories in various domains, be it in optimization [42, 32], reinforcement learning [33, 20], time series analysis [19, 1], control theory [17, 23] and simulation meta-modeling [44, 26].
  • As a result, during the past two decades, a large proportion of papers on GPs tackled approximate inference procedures to reduce the computational demands and numerical instabilities.
  • This push towards scalability dates back to the seminal paper by Quiñonero-Candela and Rasmussen [27] in 2005 which unified previous approximation methods into a single probabilistic framework based on inducing points.
  • This recent literature inlcudes distributed Cholesky factorizations [24], preconditioned
  • Results:

    7.1 Numerical Illustration of Theory

    the authors conduct simulation studies to verify the theoretical results.
  • 7.1 Numerical Illustration of Theory.
  • The authors conduct simulation studies to verify the theoretical results.
  • The authors consider n = 1024, xi i.∼i.d. N (0, 52) and yn ∼ N (0, σf2Kf + σ2In).
  • The authors perform 25 epochs of minibatch SGD updates with diminishing step sizes αk = α1/k.
  • Notice that similar numerical results can be obtained by sampling minibatches with replacement.
  • Each experiment is repeated 10 times with independent data pools
  • Conclusion:

    The authors provide theoretical guarantees for the minibatch SGD for training the Gaussian process (GP) model.
  • Given the correlation structure of GPs, the challenge lies in the bias of stochastic gradient when taking expectation w.r.t. random sampling.
  • Numerical studies support the theoretical results and show that minibatch SGD has better performance than some state-of-the-art methods for various datasets while enjoying huge computational benefits.
  • The authors note that investigating variance reduction techniques in correlated settings might be a promising direction to explore.
表格
  • Table1: Comparison of root-mean-square-error (RMSE) and training time of different GPs on benchmark datasets. We report the mean and standard error of RMSE as well as the mean and standard deviation of training time over 10 trials. The best results are in bold (lower is better). For query and borehole datasets, we are unable to fit with EGP due to memory limit
  • Table2: Illustration of sgGP on toy datasets. We follow similar setups in Table 1 but train 25 epochs
Download tables as Excel
相关工作
  • As mentioned earlier, there are several methods trying to tackle the computational complexity of GPs. Those can be roughly split into three categories, though it is by no means an exhaustive list (see the survey in [1]). Exact inference via matrix vector multiplications (MVM): This recent class of literature has had the most success in scaling GPs. Initially such approaches depended on a structured kernel matrix where data lies in a regularly spaced grid [30, 39]. Then with the help of GPU acceleration, conjugate gradient and distributed Cholesky factorization, MVMs were applied to more general settings [38, 12, 37]. Such approaches have training complexity of O(n2) (O(n log n) possible on spaced grids), yet amenable to distributed computation and GPU acceleration. Sparse approximate inference: This class of methods is based on a low rank approximation of the empirical kernel matrix where Kn ≈ KnzK−zz1Kzn and z denotes a set of inducing points with cardinality(z) = nz << n [18, 2, 8, 43, 31]. Their time complexity is mainly O(n2zn) which can be reduced to O(n + cnz) for structured and regularly spaced grids. Indeed, sparse GPs have gained increased attention since variational inference (VI) laid the theoretical foundation of this class of inducing points/kernel approximations (starting from the early work of Titsias [35]). Stochastic variational inference (SVI): Following the work of [15], SVI was introduced to GPs in [13]. The key idea is to introduce a variational distribution over the inducing points so that the VI framework is amenable to stochastic optimization. This leads to a complexity of O(n3z) at each iteration [14, 5]. Unfortunately, recent results in [7] show the need for at least O(logDn) inducing points for Gaussian kernels, which implies a superlinear growth with the input dimension.
基金
  • Raskutti were partially supported by NSF DMS-1811767
  • Al Kontar was partially supported by NSF CMMI-1931950
引用论文
  • M. Álvarez and N. D. Lawrence. Computationally efficient convolved multiple output gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011.
    Google ScholarLocate open access versionFindings
  • Mauricio Alvarez and Neil D Lawrence. Sparse convolved gaussian processes for multi-output regression. In Advances in neural information processing systems, pages 57–64, 2009.
    Google ScholarLocate open access versionFindings
  • Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil D Lawrence. Efficient multioutput gaussian processes through variational inducing kernels. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 25–32, 2010.
    Google ScholarLocate open access versionFindings
  • Sunil Arya, David Mount, Samuel E. Kemp, and Gregory Jefferis. RANN: Fast Nearest Neighbour Search (Wraps ANN Library) Using L2 Metric, 2019. URL https://CRAN.R-project.org/package=RANN. R package version 2.6.1.
    Locate open access versionFindings
  • David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
    Google ScholarLocate open access versionFindings
  • Mikio L Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(Nov):2303–2328, 2006.
    Google ScholarLocate open access versionFindings
  • David R Burt, Carl E Rasmussen, and Mark Van Der Wilk. Rates of convergence for sparse variational gaussian process regression. roceedings of the 36 th International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Andreas C Damianou, Michalis K Titsias, and Neil D Lawrence. Variational inference for latent variables and uncertain inputs in gaussian processes. The Journal of Machine Learning Research, 17(1):1425–1486, 2016.
    Google ScholarLocate open access versionFindings
  • Marc Peter Deisenroth and Jun Wei Ng. Distributed gaussian processes. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
    Findings
  • Reinhard Furrer, Marc G Genton, and Douglas Nychka. Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15(3):502–523, 2006.
    Google ScholarLocate open access versionFindings
  • Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, pages 7576–7586, 2018.
    Google ScholarLocate open access versionFindings
  • James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. UAI, 2013.
    Google ScholarFindings
  • Trong Nghia Hoang, Quang Minh Hoang, and Bryan Kian Hsiang Low. A unifying framework of anytime sparse gaussian process regression models with stochastic variational inference for big data. In ICML, pages 569–578, 2015.
    Google ScholarLocate open access versionFindings
  • Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
    Google ScholarLocate open access versionFindings
  • Cari G Kaufman, Mark J Schervish, and Douglas W Nychka. Covariance tapering for likelihoodbased estimation in large spatial data sets. Journal of the American Statistical Association, 103 (484):1545–1555, 2008.
    Google ScholarLocate open access versionFindings
  • Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe Girard. Gaussian process model based predictive control. In Proceedings of the 2004 American control conference, volume 3, pages 2214–2219. IEEE, 2004.
    Google ScholarLocate open access versionFindings
  • Raed Kontar, Shiyu Zhou, Chaitanya Sankavaram, Xinyu Du, and Yilu Zhang. Nonparametric modeling and prognosis of condition monitoring signals using multivariate gaussian convolution processes. Technometrics, 60(4):484–496, 2018.
    Google ScholarLocate open access versionFindings
  • Raed Kontar, Garvesh Raskutti, and Shiyu Zhou. Minimizing negative transfer of knowledge in multivariate gaussian processes: A scalable and regularized approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pages 2447–2455, 2011.
    Google ScholarLocate open access versionFindings
  • Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, volume 85, 2013.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
    Google ScholarLocate open access versionFindings
  • Ali Mesbah. Stochastic model predictive control: An overview and perspectives for future research. IEEE Control Systems Magazine, 36(6):30–44, 2016.
    Google ScholarLocate open access versionFindings
  • Duc-Trung Nguyen, Maurizio Filippone, and Pietro Michiardi. Exact gaussian process regression with distributed computations. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pages 1286–1295, 2019.
    Google ScholarLocate open access versionFindings
  • Trung V Nguyen, Edwin V Bonilla, et al. Collaborative multi-output gaussian processes. In UAI, pages 643–652, 2014.
    Google ScholarLocate open access versionFindings
  • Peter ZG Qian and CF Jeff Wu. Bayesian hierarchical modeling for integrating low-accuracy and high-accuracy experiments. Technometrics, 50(2):192–204, 2008.
    Google ScholarLocate open access versionFindings
  • Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
    Google ScholarLocate open access versionFindings
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71.
    Google ScholarLocate open access versionFindings
  • Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, Citeseer, 2012.
    Google ScholarFindings
  • Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264, 2006.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
    Google ScholarLocate open access versionFindings
  • Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
    Findings
  • S. Surjanovic and D. Bingham. Virtual library of simulation experiments: Test functions and datasets. Retrieved May 25, 2020, from http://www.sfu.ca/~ssurjano.
    Findings
  • Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
    Google ScholarLocate open access versionFindings
  • Volker Tresp. A bayesian committee machine. Neural computation, 12(11):2719–2741, 2000.
    Google ScholarLocate open access versionFindings
  • Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017.
    Google ScholarLocate open access versionFindings
  • Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gordon Wilson. Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems, pages 14622–14632, 2019.
    Google ScholarLocate open access versionFindings
  • Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775–1784, 2015.
    Google ScholarLocate open access versionFindings
  • Andrew G Wilson, Zhiting Hu, Russ R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
    Google ScholarLocate open access versionFindings
  • Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte–learning fast kernels. In Artificial Intelligence and Statistics, pages 1098–1106, 2015.
    Google ScholarLocate open access versionFindings
  • Xubo Yue and Raed Al Kontar. Why non-myopic bayesian optimization is promising and how far should we look-ahead? a study via rollout. AISTATS, 2020.
    Google ScholarLocate open access versionFindings
  • Jing Zhao and Shiliang Sun. Variational dependent multi-output gaussian process dynamical systems. The Journal of Machine Learning Research, 17(1):4134–4169, 2016.
    Google ScholarLocate open access versionFindings
  • Qiang Zhou, Peter ZG Qian, and Shiyu Zhou. A simple approach to emulation for computer models with qualitative and quantitative factors. Technometrics, 53(3):266–273, 2011.
    Google ScholarLocate open access versionFindings
作者
Hao Chen
Hao Chen
Lili Zheng
Lili Zheng
Raed AL Kontar
Raed AL Kontar
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科