## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Non-stochastic Best Arm Identification and Hyperparameter Optimization.

international conference on artificial intelligence and statistics, (2016)

EI

Keywords

Abstract

Motivated by the task of hyperparameter optimization, we introduce the non-stochastic best-arm identification problem. Within the multi-armed bandit literature, the cumulative regret objective enjoys algorithms and analyses for both the non-stochastic and stochastic settings while to the best of our knowledge, the best-arm identificatio...More

Code:

Data:

Introduction

- As supervised learning methods are becoming more widely adopted, hyperparameter optimization has become increasingly important to simplify and speed up the development of data processing pipelines while simultaneously yielding more accurate models.
- The authors propose a simple and intuitive algorithm that makes no assumptions on the convergence behavior of the losses, requires no inputs or free-parameters to adjust, provably outperforms the baseline method in favorable conditions and performs comparably otherwise, and empirically identifies good hyperparameters an order of magnitude faster than the baseline method on a variety of problems.

Highlights

- As supervised learning methods are becoming more widely adopted, hyperparameter optimization has become increasingly important to simplify and speed up the development of data processing pipelines while simultaneously yielding more accurate models
- A few recent works have made attempts to exploit intermediate results. These works either require explicit forms for the convergence rate behavior of the iterates which is di cult to accurately characterize for all but the simplest cases [7, 8], or focus on heuristics lacking theoretical underpinnings [9]. We build upon these previous works, and in particular study the multi-armed bandit formulation proposed in [7] and [9], where each arm corresponds to a fixed hyperparameter setting, pulling an arm corresponds to a fixed number of training iterations, and the loss corresponds to an intermediate loss on some hold-out set
- We propose a simple and intuitive algorithm that makes no assumptions on the convergence behavior of the losses, requires no inputs or free-parameters to adjust, provably outperforms the baseline method in favorable conditions and performs comparably otherwise, and empirically identifies good hyperparameters an order of magnitude faster than the baseline method on a variety of problems
- If the Successive Halving algorithm is bootstrapped by the “doubling trick” that takes no arguments as input, this procedure returns the best arm once the total number of iterations taken exceeds just 2zSH
- Applying Theorem 4 to the specific problem studied in Examples 1 and 2 shows that both Successive Halving and uniform allocation satisfy ⌫bi ⌫1 Oe (n/B) in this particular setting, where bi is the output of either algorithm and Oe suppresses poly log factors
- Since Theorem 1 is in terms of maxi i(t), such a lower bound would be in stark contrast to the stochastic setting, where arms with smaller variances have smaller envelopes and existing algorithms exploit this fact [27]

Results

- The fixed confidence setting takes an input 2 (0, 1) and guarantees to output the best arm with probability at least 1 while attempting to minimize the number of total arm pulls.
- In practice B n, and Successive Halving is a attractive option, as along with the baseline, it is the only algorithm that observes losses proportional to the number of arms and independent of the budget.
- The proposed Successive Halving algorithm of Figure 3 was originally introduced for stochastic best arm identification by [15].
- If the Successive Halving algorithm is run with any budget B > zSH the best arm is guaranteed to be returned.
- If the Successive Halving algorithm is bootstrapped by the “doubling trick” that takes no arguments as input, this procedure returns the best arm once the total number of iterations taken exceeds just 2zSH .
- Remark 2 If the authors consider the second, looser representation of zSH on the right-hand-side of the inequality in Theorem 1 and multiply this quantity by n1 n1 the authors see that the su cient number of pulls for the Successive Halving algorithm with doubling essentia1 l lyPbehaves n 1 i=2,...,n like1 (n 1) log2(n) times the average
- A su cient budget for the Successive Halving algorithm with doubling⇣to iden⌘tify the best arm is just 16ndlog2(n)e max log n2 max while the uniform ⇣strategy⌘would require a budget of at least 2n2 max log n2 max .
- An anytime performance guarantee, answers this question by showing that in such cases Successive Halving is comparable to the baseline method modulo log factors.

Conclusion

- Applying Theorem 4 to the specific problem studied in Examples 1 and 2 shows that both Successive Halving and uniform allocation satisfy ⌫bi ⌫1 Oe (n/B) in this particular setting, where bi is the output of either algorithm and Oe suppresses poly log factors.
- Since Theorem 1 is in terms of maxi i(t), such a lower bound would be in stark contrast to the stochastic setting, where arms with smaller variances have smaller envelopes and existing algorithms exploit this fact [27].

- Table1: Number of observed losses by the algorithm after B time steps and n number of arms. (B), (C), or (R) indicate fixed budget, fixed confidence, or cumulative regret, respectfully

Related work

- Despite dating back to the late 1950’s, the best arm identification problem for the stochastic setting has experienced a surge of activity in the last decade. The work has two major branches: the fixed budget setting and the fixed confidence setting. In the fixed budget setting, the algorithm is given a set of arms and a budget B and is tasked with maximizing the probability of identifying the best arm by pulling arms without exceeding the total budget. While these algorithms were developed for and analyzed in the stochastic setting, they exhibit attributes that are amenable to the nonstochastic setting. In fact, the algorithm we propose to use in this paper is the Successive Halving algorithm of [15], though the non-stochastic setting requires its own novel analysis that we present in Section 3. Successive Rejects [16] is another fixed budget algorithm that we empirically evaluate.
- We aim to leverage the iterative nature of standard learning algorithms to speed up hyperparameter optimization in a robust and principled fashion. It is clear that no algorithm can provably identify a hyperparameter with a value within ✏ of the optimal without known, explicit functions i, which means no algorithm can reject a hyperparameter setting with absolute confidence without making potentially strong assumptions. In [8], i functions are defined in an ad-hoc, algorithm-specific, and data-specific fashion which leads to strong ✏-good claims. A related line of work defines i-like functions for optimizing the computational e ciency of structural risk minimization, yielding bounds [7]. We stress that these results are only as good as the tightness and correctness of the i bounds. If the i functions are chosen to decrease too rapidly, a procedure might throw out good arms too early, and if chosen to decrease too slowly, a procedure will be overly conservative. Moreover, properly tuning these special-purpose approaches can be an onerous task for non-experts. We view our work as an empirical, data-driven driven approach to the pursuits of [7]. Also, [9] empirically studies an early stopping heuristic similar in spirit to Successive Halving.

Funding

- KJ is generously supported by ONR awards N0001415-1-2620 and N00014-13-1-0129
- AT is supported in part by a Google Faculty Award and an AWS in Education Research Grant award

Study subjects and analysis

samples: 4

We chose d 2 [2, 50] and 2 [.01, 3] uniformly at random from a linear scale, and 2 [10 6, 100] uniformly at random on a log scale. Each hyperparameter is given 4 samples resulting in 43 = 64 total arms. We performed 32 trials and used mean-squared error as the loss function

samples: 10

Kernel SVM We now consider learning a RBF-kernel SVM using Pegasos [23], with2 penalty hyperparameter 2 [10 6, 100] and kernel width 2 [100, 103] both chosen uniformly at random on a log scale per trial. Each hyperparameter was allocated 10 samples resulting in 102 = 100 total arms. The experiment was repeated for 64 trials using 0/1 loss

Reference

- Sebastien Bubeck, Remi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory, pages 23– 37.
- Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, 2012.
- Jasper Snoek, Kevin Swersky, Richard Zemel, and Ryan Adams. Input warping for bayesian optimization of non-stationary functions. In ICML, 2014.
- Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential Model-Based Optimization for General Algorithm Configuration. 2011.
- James Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl. Algorithms for HyperParameter Optimization. NIPS, 2011.
- James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, 2012.
- Alekh Agarwal, Peter Bartlett, and John Duchi. Oracle inequalities for computationally adaptive model selection. COLT, 2011.
- Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimization. arXiv:1406.3896, 2014.
- Evan R Sparks, Ameet Talwalkar, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. TuPAQ: An e cient planner for large-scale predictive analytic queries. In Symposium on Cloud Computing, 2015.
- Vincent A Cicirello and Stephen F Smith. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In National Conference on Artificial Intelligence, volume 20, 2005.
- Andras Gyorgy and Levente Kocsis. E cient multi-start strategies for local search algorithms. JAIR, 41, 2011.
- Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 2011.
- F. Pedregosa et al. Scikit-learn: Machine learning in Python. JMLR, 12, 2011.
- B. Yavuz E. Sparks S. Venkataraman D. Liu J. Freeman D. Tsai M. Amde S. Owen D. Xin R. Xin M. J. Franklin R. Zadeh M. Zaharia A. Talwalkar X. Meng, J. Bradley. MLlib: Machine learning in apache spark. JMLR-MLOSS, 2015.
- Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In ICML, 2013.
- Jean-Yves Audibert and Sebastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.
- Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. JMLR, 7, 2006.
- Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 655–662, 2012.
- Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In COLT, 2014.
- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
- Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83– 85, 2005.
- Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
- M. Lichman. UCI machine learning repository, 2013.
- Benjamin Recht and Christopher Re. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
- Emilie Kaufmann, Olivier Cappe, and Aurelien Garivier. On the complexity of best arm identification in multi-armed bandit models. JMLR, 2015.
- Aurelien Garivier and Olivier Cappe. The kl-ucb algorithm for bounded stochastic bandits and beyond. 2011.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn