An Optimal Elimination Algorithm for Learning a Best Arm

NeurIPS 2020, 2020.

Cited by: 0|Bibtex|Views8|Links
Keywords:
sequential designApproximate Best Armsample complexitypure explorationmulti armed bandit problemMore(7+)
Weibo:
In Section 4 we present the APPROXIMATE BEST ARM LIKELIHOOD ESTIMATION BY HOEFFDING whose sample complexity asymptotically matches the Hoeffding bound of estimating the mean of every arm separately

Abstract:

We consider the classic problem of $(\epsilon,\delta)$-PAC learning a best arm where the goal is to identify with confidence $1-\delta$ an arm whose mean is an $\epsilon$-approximation to that of the highest mean arm in a multi-armed bandit setting. This problem is one of the most fundamental problems in statistics and learning theory, ...More

Code:

Data:

Introduction
  • In this paper the authors study the classic problem of (ǫ, δ) − PAC learning a best arm. In this problem there is a set A of n arms and sampling an arm a ∈ A generates a random variable ξ(a) drawn from some unknown distribution D(a) ⊆ [0, 1]1.
  • The authors prove that the number of samples any elimination algorithm requires to (ǫ, δ)-learn a best arm is arbitrarily close to n 2ǫ2 log
  • The authors' results are in the standard (ǫ, δ)-PAC learning model, i.e. the goal is to find an ǫ-best arm with probability 1 − δ and sample complexity is measured in the worst case across any distribution in [0, 1].
Highlights
  • We prove that no elimination algorithm obtains sample complexity arbitrarily lower than n 2ǫ2 log
  • In this paper we study the classic problem of (ǫ, δ) − PAC learning a best arm
  • Our results are in the standard (ǫ, δ)-PAC learning model, i.e. the goal is to find an ǫ-best arm with probability 1 − δ and sample complexity is measured in the worst case across any distribution in [0, 1]
  • In Section 4 we present the APPROXIMATE BEST ARM LIKELIHOOD ESTIMATION BY HOEFFDING whose sample complexity asymptotically matches the Hoeffding bound of estimating the mean of every arm separately
  • We describe the Approximate Best Arm Likelihood Estimation (ABALEH) algorithm
Results
  • For exact best arm learning, the optimal sample complexity bounds for exponential distributions is achieved in [13].
  • When the number of arms n is fixed and δ goes to 0 and the distribution is bounded in [0, 1], a worst case sample complexity bound can be trivially achieved via the naive elimination strategy.
  • On the other hand, when fixing δ and letting the number of arms grow, it is not clear what is the asymptotic sample complexity of the problem in worst case, and it cannot be deduced from the instance based analysis.
  • Implications Obtaining algorithms with dramatic lower sample complexity for a basic problem like learning a best arm can have several consequences.
  • In Section 4 the authors present the APPROXIMATE BEST ARM LIKELIHOOD ESTIMATION BY HOEFFDING whose sample complexity asymptotically matches the Hoeffding bound of estimating the mean of every arm separately.
  • It is very likely that there is an ǫ0-close arm either in AT or in the random set R and running NAÏVE ELIMINATION with appropriate parameters on AT ∪ R will return an ǫ-best arm with probability at least 1 − δ.
  • When the authors run NAÏVE ELIMINATION with approximation (1 − α)ǫ and δ/e, the authors are guaranteed that with probability at least 1 − δ/e no arm that is ǫ-far from a⋆ will have empirical mean higher than that of a.
  • For any λ > 0 there exist δ0 and n0 s.t. for any δ < δ0 and n ≥ n0, ABA (ǫ, δ)-learns a best arm with sample complexity at most: 2+λ n ǫ2 log
Conclusion
  • This algorithm is a variant of ABA which achieves a sample complexity that is arbitrarily close to that of (ǫ, δ)-learning the mean of every arm.
  • Given Lemma 3, the proof follows in a similar manner to previous proofs by bounding the sample complexity and approximation and confidence of all sub procedures.
  • For any given λ < 1 there is a δ0 s.t. for any δ < δ0 and n > 1/δ ABALEH (ǫ, δ)-learns a best arm with sample complexity at most: 1+λ n 2ǫ2 log
Summary
  • In this paper the authors study the classic problem of (ǫ, δ) − PAC learning a best arm. In this problem there is a set A of n arms and sampling an arm a ∈ A generates a random variable ξ(a) drawn from some unknown distribution D(a) ⊆ [0, 1]1.
  • The authors prove that the number of samples any elimination algorithm requires to (ǫ, δ)-learn a best arm is arbitrarily close to n 2ǫ2 log
  • The authors' results are in the standard (ǫ, δ)-PAC learning model, i.e. the goal is to find an ǫ-best arm with probability 1 − δ and sample complexity is measured in the worst case across any distribution in [0, 1].
  • For exact best arm learning, the optimal sample complexity bounds for exponential distributions is achieved in [13].
  • When the number of arms n is fixed and δ goes to 0 and the distribution is bounded in [0, 1], a worst case sample complexity bound can be trivially achieved via the naive elimination strategy.
  • On the other hand, when fixing δ and letting the number of arms grow, it is not clear what is the asymptotic sample complexity of the problem in worst case, and it cannot be deduced from the instance based analysis.
  • Implications Obtaining algorithms with dramatic lower sample complexity for a basic problem like learning a best arm can have several consequences.
  • In Section 4 the authors present the APPROXIMATE BEST ARM LIKELIHOOD ESTIMATION BY HOEFFDING whose sample complexity asymptotically matches the Hoeffding bound of estimating the mean of every arm separately.
  • It is very likely that there is an ǫ0-close arm either in AT or in the random set R and running NAÏVE ELIMINATION with appropriate parameters on AT ∪ R will return an ǫ-best arm with probability at least 1 − δ.
  • When the authors run NAÏVE ELIMINATION with approximation (1 − α)ǫ and δ/e, the authors are guaranteed that with probability at least 1 − δ/e no arm that is ǫ-far from a⋆ will have empirical mean higher than that of a.
  • For any λ > 0 there exist δ0 and n0 s.t. for any δ < δ0 and n ≥ n0, ABA (ǫ, δ)-learns a best arm with sample complexity at most: 2+λ n ǫ2 log
  • This algorithm is a variant of ABA which achieves a sample complexity that is arbitrarily close to that of (ǫ, δ)-learning the mean of every arm.
  • Given Lemma 3, the proof follows in a similar manner to previous proofs by bounding the sample complexity and approximation and confidence of all sub procedures.
  • For any given λ < 1 there is a δ0 s.t. for any δ < δ0 and n > 1/δ ABALEH (ǫ, δ)-learns a best arm with sample complexity at most: 1+λ n 2ǫ2 log
Related work
  • The study of learning the best arm dates back to classic work by [7], and later by [1], [24], and [23]. More recently, (ǫ, δ)-PAC guarantees were studied in [10] and later by [11, 25]. There have since been other variants of this problem studied, including PAC learning a set of arms [4, 19, 22, 5], or the fixed budget setting where the goal is to minimize δ subject to a budget constraint on samples [4, 2, 12].

    Learning an ǫ-best arm. As the state-of-the-art algorithm for (ǫ, δ)-PAC learning a best arm, MEDIAN

    ELIMINATION is widely used as a sub-procedure (e.g. [18, 20, 30, 17, 6, 28]). An improvement on its sample complexity as suggested here achieves dramatically lower sample complexity for all procedures that employ MEDIAN ELIMINATION. The interesting regime in this problem setting is the one where n is large, as otherwise it use the naive sampling strategy of sampling each arm with approximation ǫ 2 and confidence δ n and selecting the largest empirical mean.2
Reference
  • Arthur E. Albert. The sequential design of experiments for infinitely many states of nature. The Annals of Mathematical Statistics, 32:774–799, 1961.
    Google ScholarLocate open access versionFindings
  • Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 41–53, 2010.
    Google ScholarFindings
  • P Borjesson and C-E Sundberg. Simple approximations of the error function q (x) for communications applications. IEEE Transactions on Communications, 27(3):639–643, 1979.
    Google ScholarLocate open access versionFindings
  • Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory, 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings, pages 23–37, 2009.
    Google ScholarLocate open access versionFindings
  • Sébastien Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 258–265, 2013.
    Google ScholarLocate open access versionFindings
  • Wei Cao, Jian Li, Yufei Tao, and Zhize Li. On top-k selection in multi-armed bandits and hidden bipartite graphs. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1036–1044. Curran Associates, Inc., 2015.
    Google ScholarLocate open access versionFindings
  • Herman Chernoff. Sequential design of experiments. Ann. Math. Statist., 30(3):755–770, 09 1959.
    Google ScholarLocate open access versionFindings
  • Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. In Advances in Neural Information Processing Systems, pages 14564–14573, 2019.
    Google ScholarLocate open access versionFindings
  • Rémy Degenne, Wouter M Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. In Advances in Neural Information Processing Systems, pages 14465–14474, 2019.
    Google ScholarLocate open access versionFindings
  • Carlos Domingo, Ricard Gavaldà, and Osamu Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Mining and Knowledge Discovery, 6(2):131–152, 2002.
    Google ScholarLocate open access versionFindings
  • Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for reinforcement learning. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 162–169, 2003.
    Google ScholarLocate open access versionFindings
  • Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In NIPS, pages 3221–3229, 2012.
    Google ScholarLocate open access versionFindings
  • Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 998–1027, 2016.
    Google ScholarLocate open access versionFindings
  • Aurélien Garivier and Emilie Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identification in bandit models. arXiv preprint arXiv:1905.03495, 2019.
    Findings
  • Torben Hagerup and Christine Rüb. Optimal merging and sorting on the erew pram. Information Processing Letters, 33(4):181–185, 1989.
    Google ScholarLocate open access versionFindings
  • Kevin G. Jamieson, Matthew Malloy, Robert D. Nowak, and Sébastien Bubeck. lil’ UCB: An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 423–439, 2014.
    Google ScholarLocate open access versionFindings
  • Kevin G. Jamieson and Robert D. Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 48th Annual Conference on Information Sciences and Systems, CISS 2014, Princeton, NJ, USA, March 19-21, 2014, pages 1–6, 2014.
    Google ScholarLocate open access versionFindings
  • Shivaram Kalyanakrishnan and Peter Stone. Efficient selection of multiple bandit arms: Theory and practice. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 511–518, 2010.
    Google ScholarLocate open access versionFindings
  • Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.
    Google ScholarLocate open access versionFindings
  • Zohar Shay Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1238–1246, 2013.
    Google ScholarLocate open access versionFindings
  • Julian Katz-Samuels and Kevin Jamieson. The true sample complexity of identifying good arms. arXiv preprint arXiv:1906.06594, 2019.
    Findings
  • Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, pages 228–251, 2013.
    Google ScholarFindings
  • Robert Keener. Second order efficiency in the sequential design of experiments. j-ANN-STAT, 12(2):510–532, June 1984.
    Google ScholarLocate open access versionFindings
  • J. Kiefer and J. Sacks. Asymptotically optimum sequential inference and design. The Annals of Mathematical Statistics, 34(3):705–750, 1963.
    Google ScholarLocate open access versionFindings
  • Shie Mannor and John N. Tsitsiklis. Lower bounds on the sample complexity of exploration in the multi-armed bandit problem. In Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, pages 418–432, 2003.
    Google ScholarLocate open access versionFindings
  • Daniel Russo. Simple bayesian algorithms for best arm identification. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 1417–1418, 2016.
    Google ScholarLocate open access versionFindings
  • Max Simchowitz, Kevin G. Jamieson, and Benjamin Recht. Best-of-k-bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 1440– 1489, 2016.
    Google ScholarLocate open access versionFindings
  • Adish Singla, Sebastian Tschiatschek, and Andreas Krause. Noisy submodular maximization via adaptive sampling with applications to crowdsourced image collection summarization. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 2037–2043, 2016.
    Google ScholarLocate open access versionFindings
  • Eric V. Slud. Distribution inequalities for the binomial law. The Annals of Probability, 5(3):404–412, 1977.
    Google ScholarLocate open access versionFindings
  • Tanguy Urvoy, Fabrice Clerot, Raphael Féraud, and Sami Naamane. Generic exploration and k-armed voting bandits. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages II–91–II–99. JMLR.org, 2013.
    Google ScholarLocate open access versionFindings
  • 0. Choosing α as a function of n. The sample complexity of ABA is a convex combination of the sample complexity of AGGRESSIVE ELIMINATION and NAÏVE ELIMINATION: 1 α2
    Google ScholarFindings
  • 2. The sample complexity of calling AGGRESSIVE
    Google ScholarFindings
Your rating :
0

 

Tags
Comments