Google Vizier: A Service for Black-Box Optimization
KDD, pp. 1487-1495, 2017.
EI
Weibo:
Abstract:
Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the ...More
Code:
Data:
Introduction
- Black–box optimization is the task of optimizing an objective function f : X → R with a limited budget for evaluations.
- Black box optimization algorithms can be used to find the best operating parameters for any system whose performance can be measured as a function of adjustable parameters
- It has many important applications, such as automated tuning of the hyperparameters of machine learning systems, optimization of the user interfaces of web services (e.g. optimizing colors and fonts.
- In this paper the authors discuss a state-of-the-art system for black–box optimization developed within Google, called Google Vizier, named after a high official who offers advice to rulers
- It is a service for black-box optimization that supports several advanced algorithms.
- The authors discuss the architecture of the system, design choices, and some of the algorithms used
Highlights
- Black–box optimization is the task of optimizing an objective function f : X → R with a limited budget for evaluations
- To evaluate the performance of Google Vizier we require functions that can be used to benchmark the results. These are pre-selected, calculated functions with known optimal points that have proven challenging for black-box optimization algorithms
- 4.2 Empirical Results In Figures 6 we look at result quality for four optimization algorithms currently implemented in the Vizier framework: a multiarmed bandit technique using a Gaussian process regressor [29], the SMAC algorithm [19], the Covariance Matrix Adaption Evolution Strategy (CMA-ES) [16], and a probabilistic search method of our own
- While some authors have claimed that 2×Random Search is highly competitive with Bayesian Optimization methods [20], our data suggests this is only true when the dimensionality of the problem is sufficiently high
- We found that the use of the performance curve stopping rule resulted in achieving optimality gaps comparable to those achieved without the stopping rule, while using approximately 50% fewer CPU-hours when tuning hyperparameter for deep neural networks
- It has already proven to be a valuable platform for research and development, and we expect it will only grow more so as the area of black–box optimization grows in importance
Methods
- Design Goals and Constraints
Vizier’s design satisfies the following desiderata: Ease of use. - The authors implemented Vizier as a managed service that stores the state of each optimization.
- This approach drastically reduces the effort a new user needs to get up and running; and a managed service with a well-documented and stable RPC API allows them to upgrade the service without user effort.
- The authors choose to make the algorithms stateless, so that the authors can seamlessly switch algorithms during a
Results
- To evaluate the performance of Google Vizier the authors require functions that can be used to benchmark the results
- These are pre-selected, calculated functions with known optimal points that have proven challenging for black-box optimization algorithms.
- A good black-box optimizer applied to the Rastrigin function might achieve an optimality gap of 160, while simple random sampling of the Beale function can quickly achieve an optimality gap of 60 [10].
- One can see that transfer learning from one study to the leads to steady progress towards the optimum, as the stack of regressors gradually builds up information about the shape of the objective function
Conclusion
- The authors have presented the design for Vizier, a scalable, state-of-theart internal service for black–box optimization within Google, explained many of its design choices, and described its use cases and benefits.
- It has already proven to be a valuable platform for research and development, and the authors expect it will only grow more so as the area of black–box optimization grows in importance.
- It designs excellent cookies, which is a very rare capability among computational systems
Summary
Introduction:
Black–box optimization is the task of optimizing an objective function f : X → R with a limited budget for evaluations.- Black box optimization algorithms can be used to find the best operating parameters for any system whose performance can be measured as a function of adjustable parameters
- It has many important applications, such as automated tuning of the hyperparameters of machine learning systems, optimization of the user interfaces of web services (e.g. optimizing colors and fonts.
- In this paper the authors discuss a state-of-the-art system for black–box optimization developed within Google, called Google Vizier, named after a high official who offers advice to rulers
- It is a service for black-box optimization that supports several advanced algorithms.
- The authors discuss the architecture of the system, design choices, and some of the algorithms used
Methods:
Design Goals and Constraints
Vizier’s design satisfies the following desiderata: Ease of use.- The authors implemented Vizier as a managed service that stores the state of each optimization.
- This approach drastically reduces the effort a new user needs to get up and running; and a managed service with a well-documented and stable RPC API allows them to upgrade the service without user effort.
- The authors choose to make the algorithms stateless, so that the authors can seamlessly switch algorithms during a
Results:
To evaluate the performance of Google Vizier the authors require functions that can be used to benchmark the results- These are pre-selected, calculated functions with known optimal points that have proven challenging for black-box optimization algorithms.
- A good black-box optimizer applied to the Rastrigin function might achieve an optimality gap of 160, while simple random sampling of the Beale function can quickly achieve an optimality gap of 60 [10].
- One can see that transfer learning from one study to the leads to steady progress towards the optimum, as the stack of regressors gradually builds up information about the shape of the objective function
Conclusion:
The authors have presented the design for Vizier, a scalable, state-of-theart internal service for black–box optimization within Google, explained many of its design choices, and described its use cases and benefits.- It has already proven to be a valuable platform for research and development, and the authors expect it will only grow more so as the area of black–box optimization grows in importance.
- It designs excellent cookies, which is a very rare capability among computational systems
Related work
- Black–box optimization makes minimal assumptions about the problem under consideration, and thus is broadly applicable across many domains and has been studied in multiple scholarly fields under names including Bayesian Optimization [2, 25, 26], Derivative– free optimization [7, 24], Sequential Experimental Design [5], and assorted variants of the multiarmed bandit problem [13, 20, 29].
Several classes of algorithms have been proposed for the problem. The simplest of these are non-adaptive procedures such as Random Search, which selects xt uniformly at random from X at each time step t independent of the previous points selected, {xτ : 1 ≤ τ < t }, and Grid Search, which selects along a grid (i.e., the Cartesian product of finite sets of feasible values for each parameter). Classic algorithms such as SimulatedAnnealing and assorted genetic algorithms have also been investigated, e.g., Covariance Matrix Adaptation [16].
Another class of algorithms performs a local search by selecting points that maintain a search pattern, such as a simplex in the case of the classic Nelder–Mead algorithm [22]. More modern variants of these algorithms maintain simple models of the objective f within a subset of the feasible regions (called the trust region), and select a point xt to improve the model within the trust region [7].
Reference
- Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Collaborative hyperparameter tuning. ICML 2 (2013), 199.
- James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. 2546–2554.
- J Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and M West. 2011. Optimization under unknown constraints. Bayesian Statistics 9 9 (2011), 229.
- Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 data-driven documents. IEEE transactions on visualization and computer graphics 17, 12 (2011), 2301–2309.
- Herman Chernoff. 1959. Sequential Design of Experiments. Ann. Math. Statist. 30, 3 (09 1959), 755–770. https://doi.org/10.1214/aoms/1177706205
- Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. 2017. Capacity and Trainability in Recurrent Neural Networks. In Profeedings of the International Conference on Learning Representations (ICLR).
- Andrew R Conn, Katya Scheinberg, and Luis N Vicente. 2009. Introduction to derivative-free optimization. SIAM.
- Thomas Desautels, Andreas Krause, and Joel W Burdick. 2014. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research 15, 1 (2014), 3873–3923.
- Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves.. In IJCAI. 3460–3468.
- Steffen Finck, Nikolaus Hansen, Raymond Rost, and Anne Auger. 2009. Real-Parameter Black-Box Optimization Benchmarking 2009: Presentation of the Noiseless Functions. http://coco.gforge.inria.fr/lib/exe/fetch.php?media=download3.6:bbobdocfunctions.pdf. (2009).[Online].
- Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. 2014. Bayesian Optimization with Inequality Constraints.. In ICML. 937–945.
- Michael A Gelbart, Jasper Snoek, and Ryan P Adams. 2014. Bayesian optimization with unknown constraints. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence. AUAI Press, 250–259.
- Josep Ginebra and Murray K. Clayton. 1995. Response Surface Bandits. Journal of the Royal Statistical Society. Series B (Methodological) 57, 4 (1995), 771–784. http://www.jstor.org/stable/2345943
- Google. 2017. Polymer: Build modern apps using web components. https://github.com/Polymer/polymer. (2017).[Online].
- Google. 2017. Protocol Buffers: Google’s data interchange format. https://github.com/google/protobuf. (2017).[Online].
- Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation 9, 2 (2001), 159–195.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
- Julian Heinrich and Daniel Weiskopf. 2013. State of the Art of Parallel Coordinates.. In Eurographics (STARs). 95–116.
- Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential modelbased optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization. Springer, 507–523.
- Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2016. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. CoRR abs/1603.06560 (2016). http://arxiv.org/abs/1603.06560
- J Moćkus, V Tiesis, and A Źilinskas. 1978. The Application of Bayesian Methods for Seeking the Extremum. Vol. 2. Elsevier. 117–128 pages.
- John A Nelder and Roger Mead. 1965. A simplex method for function minimization. The computer journal 7, 4 (1965), 308–313.
- Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Luis Miguel Rios and Nikolaos V Sahinidis. 2013. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization 56, 3 (2013), 1247–1293.
- Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. 2016. Taking the human out of the loop: A review of bayesian optimization. Proc. IEEE 104, 1 (2016), 148–175.
- Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951–2959.
- Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 2171–2180. http://jmlr.org/proceedings/papers/v37/snoek15.html
- Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. 2016. Bayesian Optimization with Robust Bayesian Neural Networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 4134–4142. http://papers.nips.cc/paper/6117-bayesian-optimization-with-robust-bayesian-neural-networks.pdf
- Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML (2010).
- Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. 2014. Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014).
- Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. 2016. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. 370–378.
- Dani Yogatama and Gideon Mann. 2014. Efficient Transfer Learning Method for Automatic Hyperparameter Tuning. JMLR: W&CP 33 (2014), 1077–1085.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
Full Text
Tags
Comments