# Bayesian Optimisation over Multiple Continuous and Categorical Inputs

ICML 2020, 2019.

EI

Weibo:

Abstract:

Efficient optimisation of black-box problems that comprise both continuous and categorical inputs is important, yet poses significant challenges. We propose a new approach, Continuous and Categorical Bayesian Optimisation (CoCaBO), which combines the strengths of multi-armed bandits and Bayesian optimisation to select values for both ca...More

Code:

Data:

Introduction

- Existing work has shown Bayesian optimisation (BO) to be remarkably successful at optimising functions with continuous input spaces [29, 17, 18, 26, 28, 12, 1].
- In many situations, optimisation problems involve a mixture of continuous and categorical variables.
- With a deep neural network, the authors may want to adjust the learning rate and the number of units in each layer, as well as the activation function type in each layer.
- In a gradient boosting ensemble of decision trees, the authors may wish to adjust the learning rate and the maximum depth of the trees, as well as the boosting algorithm and loss function.
- If some inputs are categorical variables, as opposed to continuous, the common assumption that the BO acquisition function is differentiable and continuous over the input space, which allows the acquisition

Highlights

- Existing work has shown Bayesian optimisation (BO) to be remarkably successful at optimising functions with continuous input spaces [29, 17, 18, 26, 28, 12, 1]
- We present a new Bayesian optimisation approach for optimising a black-box function with multiple continuous and categorical inputs, termed Continuous and Categorical Bayesian Optimisation (CoCaBO)
- We propose a novel method which combines the strengths of multi-armed bandits and Bayesian optimisation to optimise black-box functions with multiple categorical and continuous inputs. (Section 4.1)
- We presented a solution from a novel perspective, called Continuous and Categorical Bayesian Optimisation (CoCaBO), that harnesses the strengths of multi-armed bandits and Gaussian process-based Bayesian optimisation to tackle this problem
- We extended Categorical Bayesian Optimisation to the batch setting, enabling parallel evaluations at each stage of the optimisation
- We find Categorical Bayesian Optimisation to offer a very competitive alternative to existing approaches

Methods

- The authors compared CoCaBO against a range of existing methods which are able to handle problems with mixed type inputs: GP-based Bayesian optimisation with one-hot encoding (One-hot BO) [4], SMAC [19] and TPE [5].
- The authors used a Matérn-52 kernel for kx, as well as for One-hot BO, and used the indicator-based kernel discussed in Section 4.2 for kh.
- For both the method and One-hot BO, the authors optimised the GP hyperparameters by maximising the log marginal likelihood every 10 iterations using multi-started gradient descent, see Appendix C for more details

Conclusion

- Existing BO literature uses one-hot transformations or hierarchical approaches to encode real-world problems involving mixed continuous and categorical inputs.
- The authors' method uses a new kernel structure, which allows them to capture information within categories as well as across different categories.
- This leads to more efficient use of the acquired data and improved modelling power.
- CoCaBO demonstrated strong performance over existing methods on a variety of synthetic and real-world optimisation tasks with multiple continuous and categorical inputs.
- The authors find CoCaBO to offer a very competitive alternative to existing approaches

Summary

## Introduction:

Existing work has shown Bayesian optimisation (BO) to be remarkably successful at optimising functions with continuous input spaces [29, 17, 18, 26, 28, 12, 1].- In many situations, optimisation problems involve a mixture of continuous and categorical variables.
- With a deep neural network, the authors may want to adjust the learning rate and the number of units in each layer, as well as the activation function type in each layer.
- In a gradient boosting ensemble of decision trees, the authors may wish to adjust the learning rate and the maximum depth of the trees, as well as the boosting algorithm and loss function.
- If some inputs are categorical variables, as opposed to continuous, the common assumption that the BO acquisition function is differentiable and continuous over the input space, which allows the acquisition
## Methods:

The authors compared CoCaBO against a range of existing methods which are able to handle problems with mixed type inputs: GP-based Bayesian optimisation with one-hot encoding (One-hot BO) [4], SMAC [19] and TPE [5].- The authors used a Matérn-52 kernel for kx, as well as for One-hot BO, and used the indicator-based kernel discussed in Section 4.2 for kh.
- For both the method and One-hot BO, the authors optimised the GP hyperparameters by maximising the log marginal likelihood every 10 iterations using multi-started gradient descent, see Appendix C for more details
## Conclusion:

Existing BO literature uses one-hot transformations or hierarchical approaches to encode real-world problems involving mixed continuous and categorical inputs.- The authors' method uses a new kernel structure, which allows them to capture information within categories as well as across different categories.
- This leads to more efficient use of the acquired data and improved modelling power.
- CoCaBO demonstrated strong performance over existing methods on a variety of synthetic and real-world optimisation tasks with multiple continuous and categorical inputs.
- The authors find CoCaBO to offer a very competitive alternative to existing approaches

- Table1: Mean and standard error of the predictive log likelihood of the CoCaBO and the One-hot BO surrogates on synthetic test functions. Both models were trained on 250 samples and evaluated on 100 test points. We see that the CoCaBO surrogate can model the function surface better than the One-hot surrogate as the number of categorical variables increases
- Table2: Categorical and continuous inputs to be optimised for real-world tasks. Ni in the parentheses indicate the number of categorical choices that each categorical input has
- Table3: Notation list Meaning scalar
- Table4: Continuous and categorical input range of the synthetic test functions
- Table5: Continuous and categorical input ranges of the real-world problems

Related work

- 3.1 One-hot encoding

A common method for dealing with categorical variables is to transform them into a one-hot encoded representation, where a variable with N choices is transformed into a vector of length N with a single non-zero element. This is the approach followed by the popular BO packages like Spearmint [29] and GPyOpt [15, 4].

There are two main drawbacks with this approach. First, the commonly-used RBF (squared exponential, radial basis function) and Matérn kernels in the GP surrogate assume that f is continuous and differentiable in the input space, which is clearly not the case for one-hot encoded variables, as the objective is only defined for a small subspace within this representation.

The second drawback is that the acquisition function is optimised as a continuous function. By using this extended representation, we are turning the optimisation into a significantly harder problem due to the increased dimensionality of the search space. Additionally, the one-hot encoding makes our problem sparse, especially when we have multiple categories, each with multiple choices. This causes distances between inputs to become large, reducing the usefulness of the surrogate at such locations. As a result, the optimisation landscape is characterised by many flat regions, making it difficult to optimise [24].

Reference

- Ahsan S Alvi, Binxin Ru, Jan Calliess, Stephen J Roberts, and Michael A Osborne. Asynchronous batch Bayesian optimisation with improved local penalisation. arXiv preprint arXiv:1901.10452, 2019.
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- The GPyOpt authors. GPyOpt: A Bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
- J S Bergstra, R Bardenet, Y Bengio, and B Kégl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
- Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
- Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
- Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, pages 225–240.
- Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
- David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning, pages 1166–1174, 2013.
- Peter I Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
- Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016.
- David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems, pages 131–162.
- Javier González, Zhenwen Dai, Philipp Hennig, and Neil D Lawrence. Batch Bayesian optimization via local penalization. In International Conference on Artificial Intelligence and Statistics, pages 648–657, 2016.
- Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, and Svetha Venkatesh. Algorithmic assurance: An active approach to algorithmic testing using Bayesian optimisation. In Advances in Neural Information Processing Systems, pages 5465–5473, 2018.
- Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13:1809–1837, 2012.
- José Miguel Hernández-Lobato, Michael Gelbart, Matthew Hoffman, Ryan Adams, and Zoubin Ghahramani. Predictive entropy search for Bayesian optimization with unknown constraints. In International Conference on Machine Learning, pages 1699–1707, 2015.
- Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pages 507–523.
- Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pages 295–304, 2015.
- Brian Kulis and Michael I Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352, 2011.
- Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Santu Rana, Cheng Li, Sunil Gupta, Vu Nguyen, and Svetha Venkatesh. High dimensional Bayesian optimization with elastic Gaussian process. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 2883–2891, 2017.
- C E Rasmussen and C K I Williams. Gaussian processes for machine learning. 2006.
- Binxin Ru, Michael Osborne, and Mark McLeod. Fast information-theoretic Bayesian optimisation. In International Conference on Machine Learning, 2018.
- Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pages 3312–3320, 2015.
- Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
- Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951– 2959, 2012.
- Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pages 1015–1022, 2010.
- Jian Wu and Peter I. Frazier. The parallel knowledge gradient method for batch Bayesian optimization. In NIPS, 2016.

Tags

Comments