Scalable Meta-Learning for Bayesian Optimization

Benjamin Letham
Benjamin Letham

arXiv: Machine Learning, Volume abs/1802.02219, 2018.

Cited by: 18|Bibtex|Views116
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
Our results here show that ranking-weighted Gaussian process ensemble is a useful method for solving these problems, and provide insight into how the model behaves

Abstract:

Bayesian optimization has become a standard technique for hyperparameter optimization, including data-intensive models such as deep neural networks that may take days or weeks to train. We consider the setting where previous optimization runs are available, and we wish to use their results to warm-start a new optimization run. We develop ...More

Code:

Data:

Introduction
  • Bayesian optimization is a technique for solving black-box optimization problems with expensive function evaluations.
  • The authors estimate the underlying function with GP regression, yielding a posterior f (x|D) that has mean μ(x) and variance σ2(x), which are known analytically [Rasmussen and Williams, 2006].
  • These quantities depend on the GP kernel, which has several hyperparameters that are inferred when the model is fit.
  • The GP posterior at a collection of points [f (x1|D), . . . , f] has a multivariate normal distribution with mean and covariance matrix denoted as μ(x1, . . . , xn) and Σ(x1, . . . , xn)
Highlights
  • Bayesian optimization is a technique for solving black-box optimization problems with expensive function evaluations
  • The “black-box” nature of the optimization assumes that nothing is known about the problem besides its function evaluations, but there are settings in which ancillary information is available in the form of prior optimizations
  • We evaluate its performance using a large collection of SVM hyperparameter optimization benchmark problems and show its use on a real problem by optimizing the computer vision platform at Facebook
  • We present several sets of experiments to explore how ranking-weighted Gaussian process ensemble performs in practice
  • Our results here show that ranking-weighted Gaussian process ensemble is a useful method for solving these problems, and provide insight into how the model behaves
Methods
  • The authors present several sets of experiments to explore how RGPE performs in practice. The authors begin with a simple synthetic function in Section 5.1.
  • Section 5.2 provides a comprehensive study of RGPE performance using a large collection of hyperparameter optimization benchmark problems
  • These are followed up by the results of using warm-starting inside the computer vision platform at Facebook, which provide useful insight into its real-world application.
  • All GPs in these experiments used GPy and the ARD Matern 5/2 kernel [GPy, since 2012].
  • Kernel hyperparameters were set to their posterior means, inferred via MCMC with the NUTS sampler [Hoffman and Gelman, 2014]
Conclusion
  • The Introduction described two settings in which warmstarting may be useful: re-optimization runs and very short runs.
  • Both of these settings occur frequently in production machine learning systems at Facebook.
  • The target model will generalize better and will have relatively fewer misrankings.
  • As a result the ensemble will eventually put all of its weight on the target model and will revert to standard Bayesian optimization.
  • The ensemble provides a warmstart that improves performance while the target GP is weak, and is faded out as the target GP becomes more useful
Summary
  • Introduction:

    Bayesian optimization is a technique for solving black-box optimization problems with expensive function evaluations.
  • The authors estimate the underlying function with GP regression, yielding a posterior f (x|D) that has mean μ(x) and variance σ2(x), which are known analytically [Rasmussen and Williams, 2006].
  • These quantities depend on the GP kernel, which has several hyperparameters that are inferred when the model is fit.
  • The GP posterior at a collection of points [f (x1|D), . . . , f] has a multivariate normal distribution with mean and covariance matrix denoted as μ(x1, . . . , xn) and Σ(x1, . . . , xn)
  • Methods:

    The authors present several sets of experiments to explore how RGPE performs in practice. The authors begin with a simple synthetic function in Section 5.1.
  • Section 5.2 provides a comprehensive study of RGPE performance using a large collection of hyperparameter optimization benchmark problems
  • These are followed up by the results of using warm-starting inside the computer vision platform at Facebook, which provide useful insight into its real-world application.
  • All GPs in these experiments used GPy and the ARD Matern 5/2 kernel [GPy, since 2012].
  • Kernel hyperparameters were set to their posterior means, inferred via MCMC with the NUTS sampler [Hoffman and Gelman, 2014]
  • Conclusion:

    The Introduction described two settings in which warmstarting may be useful: re-optimization runs and very short runs.
  • Both of these settings occur frequently in production machine learning systems at Facebook.
  • The target model will generalize better and will have relatively fewer misrankings.
  • As a result the ensemble will eventually put all of its weight on the target model and will revert to standard Bayesian optimization.
  • The ensemble provides a warmstart that improves performance while the target GP is weak, and is faded out as the target GP becomes more useful
Related work
  • Borrowing strength from past runs is a form of meta-learning, and in the context of Bayesian optimization is often called transfer learning. A key requirement is to determine which past runs are similar to the current task. Several past methods have used manually defined metafeatures of the datasets to measure task similarity [Brazdil et al, 1994]. Bardenet et al [2013] simultaneously optimize several problems by using metafeatures of the data as features of the GP along with the hyperparameters to optimize. Observations from all runs are put on the same scale using an SVMRANK model and then used in a single GP. Yogatama and Mann [2014] select similar past optimization runs based on the nearest neighbors in metafeature space. Observations from all similar runs are then combined in a single GP. Schilling et al [2016] construct a GP for each past optimization run including both the past observations and those of the current task, with task similarity described by metafeatures. These models are combined using the product of GP experts model [Cao and Fleet, 2014]. Feurer et al [2015] use metafeature similarity to select initial points for the optimization as the best points from similar runs, and then proceed with usual single-task Bayesian optimization. These methods all require metafeatures, whereas here we seek to develop a method that does not require metafeatures.
Funding
  • Thanks to Till Varoquaux for support in developing the method, and to Alex Chen for supporting the Lumos experiments
Study subjects and analysis
datasets: 50
Our main experimental validation of RGPE uses a large set of hyperparameter optimization benchmark problems from Wistuba et al [2015], and also used in Wistuba et al [2016]. They did hyperparameter searches for SVM on a diverse set of 50 datasets, with sizes ranging from 35 to 250000 training examples, and from 2 to 7000 features. For each dataset, testset accuracy was measured on a grid of six parameters: three binary parameters indicating a linear, polynomial, or RBF kernel; the penalty parameter C; the degree of the polynomial kernel (0 if unused); and the RBF kernel bandwidth (0 if unused)

Reference
  • Sebag. Collaborative hyperparameter tuning. In Proceedings of the 30th International Conference on Machine Learning, ICML, 2013.
    Google ScholarLocate open access versionFindings
  • Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29, NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • James S. Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24, pages 2546–2554, 2011.
    Google ScholarLocate open access versionFindings
  • Pavel Brazdil, Joao Gama, and Bob Henery. Characterizing the applicability of classification algorithms using metalevel learning. In Proceedings of the European Conference on Machine Learning, ECML, pages 83–102, 1994.
    Google ScholarLocate open access versionFindings
  • Yanshuai Cao and David J. Fleet. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In Modern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Lehel Csatoand Manfred Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641–668, 2002.
    Google ScholarLocate open access versionFindings
  • John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani. Fast Gaussian process methods for point process intensity estimation. In Proceedings of the 25th International Conference on Machine Learning, ICML, 2008.
    Google ScholarLocate open access versionFindings
  • Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing Bayesian hyperparameter optimization via meta-learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI, 2015.
    Google ScholarLocate open access versionFindings
  • Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining, KDD, pages 1487–1496, 2017.
    Google ScholarLocate open access versionFindings
  • GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
    Findings
  • James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, UAI, 2013.
    Google ScholarLocate open access versionFindings
  • Matthew D. Hoffman and Andrew Gelman. The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351–1381, 2014.
    Google ScholarLocate open access versionFindings
  • Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown, and Thomas Stutzle. ParamILS: an automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36:267–306, 2009.
    Google ScholarLocate open access versionFindings
  • Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the 5th Conference on Learning and Intelligent Optimization, LION, pages 507 – 523, 2011.
    Google ScholarLocate open access versionFindings
  • Momin Jamil and Xin-She Yang. A literature survey of benchmark functions for global optimization problems. International Journal of Mathematical Modeling and Numerical Optimisation, 4(2):150–194, 2013.
    Google ScholarLocate open access versionFindings
  • Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492, 1998.
    Google ScholarLocate open access versionFindings
  • Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. Multi-fidelity Bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning, ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Alexandre Lacoste, Hugo Larochelle, Mario Marchand, and Francois Laviolette. Agnostic Bayesian learning of ensembles. In Proceedings of the 31st International Conference on Machine Learning, ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained bayesian optimization with noisy experiments. arXiv:0706.1234 [stats.ML], 2017.
    Findings
  • M. Lindauer and F. Hutter. Warmstarting of model-based algorithm configuration. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Matthias Poloczek, Jialei Wang, and Peter I. Frazier. Warm starting Bayesian optimization. In Winter Simulation Conference, WSC, 2016. arXiv:1608.03585.
    Findings
  • Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Massachusetts, 2006.
    Google ScholarFindings
  • Nicolas Schilling, Martin Wistuba, and Lars SchmidtThieme. Scalable hyperparameter optimization with products of Gaussian process experts. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD, 2016.
    Google ScholarLocate open access versionFindings
  • Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, Jan 2016.
    Google ScholarLocate open access versionFindings
  • Alistair Shilton, Sunil Gupta, Santu Rana, and Svetha Venkatesh. Regret bounds for transfer learning in Bayesian optimisation. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS, 2017.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems 29, NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Kevin Swersky, Jasper Snoek, and Ryan P. Adams. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems 26, NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • Volker Tresp. Mixtures of gaussian processes. In Advances in Neural Information Processing Systems 13, pages 654– 660, 2001.
    Google ScholarLocate open access versionFindings
  • Martin Wistuba, Nicolas Schilling, and Lars SchmidtThieme. Learning hyperparameter optimization initializations. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, DSAA, 2015.
    Google ScholarLocate open access versionFindings
  • Martin Wistuba, Nicolas Schilling, and Lars SchmidtThieme. Two-stage transfer surrogate model for automatic hyperparameter optimization. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD, 2016.
    Google ScholarLocate open access versionFindings
  • Martin Wistuba. TST-R implementation, 2016. https://github.com/wistuba/TST.
    Findings
  • Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, AISTATS, 2014.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments