Mixed-Variable Bayesian Optimization

IJCAI 2020, pp. 2633-2639, 2020.

Cited by: 4|Bibtex|Views158|DOI:https://doi.org/10.24963/ijcai.2020/365
EI
Other Links: arxiv.org|dblp.uni-trier.de
Weibo:
We propose MIVABO, a simple yet effective method for efficient optimization of expensive-toevaluate mixed-variable black-box objective functions, combining a linear model of expressive features with Thompson sampling

Abstract:

The optimization of expensive to evaluate, black-box, mixed-variable functions, i.e. functions that have continuous and discrete inputs, is a difficult and yet pervasive problem in science and engineering. In Bayesian optimization (BO), special cases of this problem that consider fully continuous or fully discrete domains have been wide...More

Code:

Data:

0
Introduction
  • Bayesian optimization (BO) [37] is a well-established paradigm to optimize costly-to-evaluate, black-box objectives that has been successfully applied to multiple scientific domains.
  • Since evaluating f is costly, the goal is to query inputs based on past observations to find a global minimizer x∗ ∈ arg minx∈X f (x) as efficiently and accurately as possible
  • To this end, BO algorithms leverage two components: (i) a probabilistic function model, known as surrogate, that encodes the belief about f based on the observations available, and an acquisition function α : X → R that expresses the informativeness of input x about the location of x∗, given the surrogate of f.
  • The goal of the acquisition function is to simultaneously learn about the inputs that are likely to be optimal and about poorly explored regions of the input space, i.e. to trade-off exploitation against exploration
Highlights
  • Bayesian optimization (BO) [37] is a well-established paradigm to optimize costly-to-evaluate, black-box objectives that has been successfully applied to multiple scientific domains
  • We show how to use Thompson sampling [59] to suggest informative inputs to query (Sec. 3.3) and, provide a bound on the regret incurred by MIVABO. (Sec. 3.4)
  • For the continuous model part, we employ Random Fourier Features (RFFs) approximating a Gaussian process with a squared exponential (SE) kernel, as we found Random Fourier Features to provide the best tradeoff between complexity and accuracy in practice
  • We consider one of the most popular algorithms from OpenML, namely XGBoost, which is an efficient implementation of the extreme gradient boosting framework from [11]
  • We propose MIVABO, a simple yet effective method for efficient optimization of expensive-toevaluate mixed-variable black-box objective functions, combining a linear model of expressive features with Thompson sampling
  • Our method is characterized by a high degree of flexibility due to the modularity of its components, i.e., the feature mapping used to model the mixed-input objective, and the optimization oracles used as subroutines for the acquisition procedure
Methods
  • The authors here present experimental results on tuning the hyperparameters of two machine learning algorithms, namely gradient boosting and a deep generative model.
  • Random TPE SMAC GPyOpt MiVaBO.
  • 0.0 1 10 20 30 40 50 60 70 80 90 100 # of sampled hyperparameter configurations validation error Random TPE.
  • 0.00 1 10 20 30 40 50 60 70 80 90 100 # of sampled hyperparameter configurations negative test log-likelihood MiVaBo. 80 1 4 8 12 16 20 24 28 32 # of sampled hyperparameter configurations
Results
  • The authors use the publicly available OpenML database [60], which contains evaluations for various machine learning methods trained on several datasets with many hyperparameter settings.
  • The authors consider one of the most popular algorithms from OpenML, namely XGBoost, which is an efficient implementation of the extreme gradient boosting framework from [11].
  • The results in Fig. 1 show that MIVABO achieves performance which is either significantly stronger than or competitive with the state-of-the-art mixed-variable BO algorithms on this challenging task.
  • As compared to TPE and SMAC, the method likely benefits from more sophisticated uncertainty estimation
Conclusion
  • The authors propose MIVABO, a simple yet effective method for efficient optimization of expensive-toevaluate mixed-variable black-box objective functions, combining a linear model of expressive features with Thompson sampling.
  • The authors' method is characterized by a high degree of flexibility due to the modularity of its components, i.e., the feature mapping used to model the mixed-input objective, and the optimization oracles used as subroutines for the acquisition procedure.
  • This allows practitioners to tailor MIVABO to specific objectives, e.g. by incorporating prior knowledge in the feature design or by exploiting optimization oracles that can handle specific types of constraints.
  • The authors empirically demonstrate that MIVABO significantly improves optimization performance as compared to state-of-the-art data driven methods for mixed-variable optimization
Summary
  • Introduction:

    Bayesian optimization (BO) [37] is a well-established paradigm to optimize costly-to-evaluate, black-box objectives that has been successfully applied to multiple scientific domains.
  • Since evaluating f is costly, the goal is to query inputs based on past observations to find a global minimizer x∗ ∈ arg minx∈X f (x) as efficiently and accurately as possible
  • To this end, BO algorithms leverage two components: (i) a probabilistic function model, known as surrogate, that encodes the belief about f based on the observations available, and an acquisition function α : X → R that expresses the informativeness of input x about the location of x∗, given the surrogate of f.
  • The goal of the acquisition function is to simultaneously learn about the inputs that are likely to be optimal and about poorly explored regions of the input space, i.e. to trade-off exploitation against exploration
  • Methods:

    The authors here present experimental results on tuning the hyperparameters of two machine learning algorithms, namely gradient boosting and a deep generative model.
  • Random TPE SMAC GPyOpt MiVaBO.
  • 0.0 1 10 20 30 40 50 60 70 80 90 100 # of sampled hyperparameter configurations validation error Random TPE.
  • 0.00 1 10 20 30 40 50 60 70 80 90 100 # of sampled hyperparameter configurations negative test log-likelihood MiVaBo. 80 1 4 8 12 16 20 24 28 32 # of sampled hyperparameter configurations
  • Results:

    The authors use the publicly available OpenML database [60], which contains evaluations for various machine learning methods trained on several datasets with many hyperparameter settings.
  • The authors consider one of the most popular algorithms from OpenML, namely XGBoost, which is an efficient implementation of the extreme gradient boosting framework from [11].
  • The results in Fig. 1 show that MIVABO achieves performance which is either significantly stronger than or competitive with the state-of-the-art mixed-variable BO algorithms on this challenging task.
  • As compared to TPE and SMAC, the method likely benefits from more sophisticated uncertainty estimation
  • Conclusion:

    The authors propose MIVABO, a simple yet effective method for efficient optimization of expensive-toevaluate mixed-variable black-box objective functions, combining a linear model of expressive features with Thompson sampling.
  • The authors' method is characterized by a high degree of flexibility due to the modularity of its components, i.e., the feature mapping used to model the mixed-input objective, and the optimization oracles used as subroutines for the acquisition procedure.
  • This allows practitioners to tailor MIVABO to specific objectives, e.g. by incorporating prior knowledge in the feature design or by exploiting optimization oracles that can handle specific types of constraints.
  • The authors empirically demonstrate that MIVABO significantly improves optimization performance as compared to state-of-the-art data driven methods for mixed-variable optimization
Tables
  • Table1: Hyperparameters of the VAE. The architecture of the VAE (if all layers are enabled) is C1-C2-F1-F2-z-F3-F4-D1-D2, with C denoting a convolutional (conv.) layer, F a fullyconnected (fc.) layer, D a deconvolutional (deconv.) layer and z the latent space. Layers F2 and F3 have fixed sizes of 2dz and dz units respectively, where dz denotes the dimensionality of the latent space z. The domain of the number of units of the fc. layers F1 and F4 is discretized with a step size of 64, i.e. [0, 64, 128, . . . , 832, 896, 960], denoted by [0 . . . 960] in the table for brevity. For dz, the domain [16 . . . 64] refers to all integers within that interval
  • Table2: Mean plus/minus one standard deviation of the negative test log-likelihood over 8 random initializations, achieved by the best VAE configuration found by SMAC, TPE and GPyOpt after 16 BO iterations, for constraint violation penalties of 500, 250 and 125 nats. Performance values of MIVABO and random search (which are not affected by the penalty) are included for reference
  • Table3: Mean plus/minus one standard deviation of the number of constraint violations by SMAC, TPE, GPyOpt and random search within 16 BO iterations over 8 random initializations, for constraint violation penalties of 500, 250 and 125 nats
  • Table4: Hyperparameters of the XGBoost algorithm. 10 parameters, 3 of which are discrete
Download tables as Excel
Related work
  • The efficient optimization of black-box functions over continuous domains has been extensively studied in the BO literature [58, 62, 20]. However, to adapt these methods to the mixed-variable setting, it is necessary to use ad-hoc relaxation techniques to map the problem to a fully continuous one and rounding methods to map the resulting solution to the original domain. This procedure ignores the original structure of the domain and makes the quality of the solution dependent on the choice of relaxation and rounding methods. Moreover, in this setting, it is hard to

    Preprint. Under review.

    incorporate constraints over the discrete input variables. More recently, BO algorithms for discrete domains have been proposed [4, 42]. However, the application of these methods to the mixedvariable setting requires discretizing the continuous part of the domain, where the discretization granularity plays a crucial role: If it is too small, it makes the input space prohibitively large; if it is too large, the resulting domain may contain only poorly performing values of the continuous inputs.
Funding
  • This research has been partially supported by SNSF NFP75 grant 407540 167189
  • Matteo Turchetta was supported through the ETH-MPI Center for Learning Systems
Reference
  • Marc Abeille, Alessandro Lazaric, et al. Linear Thompson sampling revisited. Electronic Journal of Statistics, 2017.
    Google ScholarLocate open access versionFindings
  • Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on Machine learning (ICML), 2013.
    Google ScholarLocate open access versionFindings
  • Lukas Bajer and Martin Holena. Surrogate model for continuous and discrete genetic optimization based on rbf networks. In Proceedings of the 11th International Conference on Intelligent Data Engineering and Automated Learning, 2010.
    Google ScholarLocate open access versionFindings
  • Ricardo Baptista and Matthias Poloczek. Bayesian optimization of combinatorial structures. arXiv preprint arXiv:1806.08838, 2018.
    Findings
  • James Bergstra, Remi Bardenet, B Kegl, and Y Bengio. Implementations of algorithms for hyper-parameter optimization. In NIPS Workshop on Bayesian optimization, 2011.
    Google ScholarLocate open access versionFindings
  • James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
    Google ScholarLocate open access versionFindings
  • James S Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems (NIPS), 2011.
    Google ScholarLocate open access versionFindings
  • Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.
    Google ScholarFindings
  • Endre Boros and Peter L Hammer. Pseudo-boolean optimization. Discrete applied mathematics, 123(1-3):155–225, 2002.
    Google ScholarLocate open access versionFindings
  • Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
    Findings
  • Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In International Conference on Knowledge Discovery and Data Mining, 2016.
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1538–1546, 2015.
    Google ScholarLocate open access versionFindings
  • Katharina Eggensperger, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Efficient benchmarking of hyperparameter optimizers via surrogates. In AAAI, pages 1114–1120, 2015.
    Google ScholarLocate open access versionFindings
  • Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774, 2018.
    Findings
  • Eduardo C. Garrido-Merchan and Daniel Hernandez-Lobato. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. CoRR, 2018.
    Google ScholarFindings
  • Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 1995.
    Google ScholarFindings
  • J Gonzalez. GPyOpt: A Bayesian optimization framework in Python, 2016.
    Google ScholarFindings
  • Trevor J Hastie. Generalized additive models. In Statistical models in S, pages 249–307. Routledge, 2017.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: A spectral approach. arXiv preprint arXiv:1706.00764, 2017.
    Findings
  • P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. JMLR, 13:1809–1837, 2012.
    Google ScholarLocate open access versionFindings
  • Daniel Hernandez-Lobato, Jose Miguel Hernandez-Lobato, and Pierre Dupont. Generalized spike-and-slab priors for bayesian group feature selection using expectation propagation. The Journal of Machine Learning Research, 14(1):1891–1945, 2013.
    Google ScholarLocate open access versionFindings
  • Jose Miguel Hernandez-Lobato, Daniel Hernandez-Lobato, and Alberto Suarez. Expectation propagation in linear regression models with spike-and-slab priors. Machine Learning, 99(3):437–487, 2015.
    Google ScholarLocate open access versionFindings
  • Trong Nghia Hoang, Quang Minh Hoang, Ruofei Ouyang, and Kian Hsiang Low. Decentralized high-dimensional bayesian optimization with factor graphs. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Automated configuration of mixed integer programming solvers. In International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, 2010.
    Google ScholarLocate open access versionFindings
  • Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523.
    Google ScholarLocate open access versionFindings
  • IBM. Users manual for CPLEX. International Business Machines Corporation, 46(53):157, 2009.
    Google ScholarLocate open access versionFindings
  • Rodolphe Jenatton, Cedric Archambeau, Javier Gonzalez, and Matthias Seeger. Bayesian optimization with tree-structured dependencies. In International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Donald R. Jones. Direct global optimization algorithm. Encyclopedia of Optimization, 2001.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    Findings
  • Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
    Google ScholarFindings
  • Nikos Komodakis, Nikos Paragios, and Georgios Tziritas. MRF energy minimization and beyond via dual decomposition. IEEE transactions on pattern analysis and machine intelligence, 2011.
    Google ScholarLocate open access versionFindings
  • Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 2008.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2:18, 2010.
    Findings
  • Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
    Findings
  • Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
    Google ScholarLocate open access versionFindings
  • Thomas P Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369, 2001.
    Google ScholarLocate open access versionFindings
  • Jonas Mockus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404.
    Google ScholarLocate open access versionFindings
  • Mojmir Mutnyand Andreas Krause. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems (NIPS), December 2018.
    Google ScholarLocate open access versionFindings
  • Diana M Negoescu, Peter I Frazier, and Warren B Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 2011.
    Google ScholarLocate open access versionFindings
  • Hannes Nickisch. glm-ie: generalised linear models inference & estimation toolbox. Journal of Machine Learning Research, 13(May):1699–1703, 2012.
    Google ScholarLocate open access versionFindings
  • Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
    Google ScholarFindings
  • Changyong Oh, Jakub M. Tomczak, Efstratios Gavves, and Max Welling. Combinatorial bayesian optimization using graph representations. arXiv preprint arXiv:1902.00448, 2019.
    Findings
  • Gurobi Optimization. Gurobi optimizer reference manual. http://www.gurobi.com, 2014.
    Findings
  • Valerio Perrone, Rodolphe Jenatton, Matthias Seeger, and Cedric Archambeau. Multiple adaptive Bayesian linear regression for scalable bayesian optimization with warm start. arXiv preprint arXiv:1712.02902, 2017.
    Findings
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems (NIPS), 2009.
    Google ScholarLocate open access versionFindings
  • Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537, 2018.
    Findings
  • Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
    Findings
  • Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional Bayesian optimization via additive models with overlapping groups. arXiv preprint arXiv:1802.07028, 2018.
    Findings
  • Alexander M Rush and MJ Collins. A tutorial on dual decomposition and Lagrangian relaxation for inference in natural language processing. Journal of Artificial Intelligence Research, 45:305–362, 2012.
    Google ScholarLocate open access versionFindings
  • Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, 2014.
    Findings
  • Matthias W Seeger. Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research, 9(Apr):759–813, 2008.
    Google ScholarLocate open access versionFindings
  • Matthias W Seeger and Hannes Nickisch. Compressed sensing and bayesian experimental design. In International Conference on Machine Learning (ICML). ACM, 2008.
    Google ScholarLocate open access versionFindings
  • Matthias W Seeger and Hannes Nickisch. Large scale Bayesian inference and experimental design for sparse linear models. SIAM Journal on Imaging Sciences, 4(1):166–199, 2011.
    Google ScholarLocate open access versionFindings
  • Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable Bayesian optimization using deep neural networks. In International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • David Sontag, Amir Globerson, and Tommi Jaakkola. Introduction to dual composition for inference. In Optimization for Machine Learning. MIT Press, 2011.
    Google ScholarFindings
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
    Google ScholarLocate open access versionFindings
  • William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
    Google ScholarLocate open access versionFindings
  • Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
    Google ScholarLocate open access versionFindings
  • Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1–2):1–305, 2008.
    Google ScholarLocate open access versionFindings
  • Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press Cambridge, MA, 2006.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments