AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Results reported in Tables 4 and 5 show that Random Block Sampling-sampling Duelling Algorithms and Without Replacement-SDA are strong competitors to Thompson Sampling and IMED for both Bernoulli and Gaussian bandits

Sub-sampling for Efficient Non-Parametric Bandit Exploration

NIPS 2020, (2020)

Cited by: 0|Views13
EI
Full Text
Bibtex
Weibo

Abstract

In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our propos...More
0
Introduction
  • A K-armed bandit problem is a sequential decision-making problem in which a learner sequentially samples from K unknown distributions called arms.
  • K} and obtains a random reward Xt drawn from the distribution of the chosen arm, that has mean μAt. The learner should adjust her sequential sampling strategy A = (At)t∈N in order to maximize the expected sum of rewards obtained after T selections.
  • TS is a randomized Bayesian algorithm that selects arms according to their posterior probability of being optimal
  • These algorithms enjoy logarithmic regret under some assumptions on the arms, and some of them are even asymptotically optimal in that they attain the smallest possible asymptotic regret given by the lower bound of Lai & Robbins [9], for some parametric families of distributions.
  • For distributions that are continuously parameterized by their means, this lower bound states that under any uniformly efficient algorithm, lim inf RT (A) ≥
Highlights
  • A K-armed bandit problem is a sequential decision-making problem in which a learner sequentially samples from K unknown distributions called arms
  • Our first objective is to check that for a finite horizon the regret of Random Block Sampling (RB)-sampling Duelling Algorithms (SDA) is comparable with the regret of Thompson Sampling, which efficiently uses the knowledge of the distribution
  • Results reported in Tables 4 and 5 show that RB-SDA and Without Replacement (WR)-SDA are strong competitors to Thompson Sampling (TS) and IMED for both Bernoulli and Gaussian bandits
  • The cost of sub-sampling varies across algorithms: in the general case RB-SDA is more efficient than WR-SDA as the latter requires to draw a random subset while the former only needs to draw the random integer starting the block
  • We proved that one particular instance, RB-SDA, combines both optimal theoretical guarantees and good empirical performance for several distributions, possibly with unbounded support
  • The empirical study presented in the paper shows the robustness of sub-sampling approach over other types of re-sampling algorithms. This new approach to exploration may be generalized in many directions, for example to contextual bandits or reinforcement learning, where Upper Confidence Bounds (UCB) and Thompson Sampling are still the dominant approaches
Methods
  • The authors perform experiments on simulated data in order to illustrate the good performance of the four instances of SDA algorithms introduced in Section 2 for various distributions.
  • Exponential families First, in order to illustrate Corollary 3.1.1, the authors investigate the performance of RB-SDA for both Bernoulli and Gaussian distributions.
  • For Bernoulli and Gaussian distribution, Non-Parameteric TS coincides with Thompson Sampling, so the authors focus the study on algorithms based on history perturbation.
  • The authors experiment with PHE [14] for Bernoulli bandits and ReBoot [17] for Gaussian bandits, as those two algorithms are guaranteed to have logarithmic regret in each of these settings.
Results
  • Results reported in Tables

    4 and 5 show that RB-SDA and WR-SDA are strong competitors to TS and IMED for both Bernoulli and Gaussian bandits.
  • The cost of sub-sampling varies across algorithms: in the general case RB-SDA is more efficient than WR-SDA as the latter requires to draw a random subset while the former only needs to draw the random integer starting the block.
  • The computational cost of these two algorithms is difficult to evaluate precisely
  • They can be made very efficient when the leader does not change, but each change of leader is costly, in particular for SSMC.
  • Non-Parametric TS has a good performance for Truncated Gaussian, but the cost of drawing a random probability vector over a large history is very high
Conclusion
  • The authors introduced the SDA framework for exploration in bandits models. The authors proved that one particular instance, RB-SDA, combines both optimal theoretical guarantees and good empirical performance for several distributions, possibly with unbounded support.
  • The empirical study presented in the paper shows the robustness of sub-sampling approach over other types of re-sampling algorithms
  • This new approach to exploration may be generalized in many directions, for example to contextual bandits or reinforcement learning, where UCB and Thompson Sampling are still the dominant approaches.
  • (K − 1)P (Xcr,cr,j < βr,j ) + αk(βr,j , j) k=2 as cr ≤ mr, which proves Lemma 4.3
  • This definition allows to analyze separately the properties of the sub-sampling algorithms and the properties of the distribution family for randomized samplers.
  • The authors aim at upper bounding the probability of
Summary
  • Introduction:

    A K-armed bandit problem is a sequential decision-making problem in which a learner sequentially samples from K unknown distributions called arms.
  • K} and obtains a random reward Xt drawn from the distribution of the chosen arm, that has mean μAt. The learner should adjust her sequential sampling strategy A = (At)t∈N in order to maximize the expected sum of rewards obtained after T selections.
  • TS is a randomized Bayesian algorithm that selects arms according to their posterior probability of being optimal
  • These algorithms enjoy logarithmic regret under some assumptions on the arms, and some of them are even asymptotically optimal in that they attain the smallest possible asymptotic regret given by the lower bound of Lai & Robbins [9], for some parametric families of distributions.
  • For distributions that are continuously parameterized by their means, this lower bound states that under any uniformly efficient algorithm, lim inf RT (A) ≥
  • Objectives:

    The authors' objective is to bound T P(HkT ) and log(T )P(GkT ) by constants.
  • Methods:

    The authors perform experiments on simulated data in order to illustrate the good performance of the four instances of SDA algorithms introduced in Section 2 for various distributions.
  • Exponential families First, in order to illustrate Corollary 3.1.1, the authors investigate the performance of RB-SDA for both Bernoulli and Gaussian distributions.
  • For Bernoulli and Gaussian distribution, Non-Parameteric TS coincides with Thompson Sampling, so the authors focus the study on algorithms based on history perturbation.
  • The authors experiment with PHE [14] for Bernoulli bandits and ReBoot [17] for Gaussian bandits, as those two algorithms are guaranteed to have logarithmic regret in each of these settings.
  • Results:

    Results reported in Tables

    4 and 5 show that RB-SDA and WR-SDA are strong competitors to TS and IMED for both Bernoulli and Gaussian bandits.
  • The cost of sub-sampling varies across algorithms: in the general case RB-SDA is more efficient than WR-SDA as the latter requires to draw a random subset while the former only needs to draw the random integer starting the block.
  • The computational cost of these two algorithms is difficult to evaluate precisely
  • They can be made very efficient when the leader does not change, but each change of leader is costly, in particular for SSMC.
  • Non-Parametric TS has a good performance for Truncated Gaussian, but the cost of drawing a random probability vector over a large history is very high
  • Conclusion:

    The authors introduced the SDA framework for exploration in bandits models. The authors proved that one particular instance, RB-SDA, combines both optimal theoretical guarantees and good empirical performance for several distributions, possibly with unbounded support.
  • The empirical study presented in the paper shows the robustness of sub-sampling approach over other types of re-sampling algorithms
  • This new approach to exploration may be generalized in many directions, for example to contextual bandits or reinforcement learning, where UCB and Thompson Sampling are still the dominant approaches.
  • (K − 1)P (Xcr,cr,j < βr,j ) + αk(βr,j , j) k=2 as cr ≤ mr, which proves Lemma 4.3
  • This definition allows to analyze separately the properties of the sub-sampling algorithms and the properties of the distribution family for randomized samplers.
  • The authors aim at upper bounding the probability of
Tables
  • Table1: Regret at T = 20000 for Bernoulli arms
  • Table2: Regret at T = 20000 for Gaussian arms
  • Table3: Regret at T = 20000 for Truncated Gaussian arms
  • Table4: Average Regret on 10000 random experiments with Bernoulli Arms
  • Table5: Average Regret on 10000 random experiments with Gaussian Arms
  • Table6: Regret and at T = 20000 for Bernoulli arms, with standard deviation
  • Table7: Regret and at T = 20000 for Gaussian arms, with standard deviation
  • Table8: Table 8
  • Table9: Average Regret with Exponential Arms (with std)
  • Table10: Quantile
  • Table11: Average Regret with Exponential Arms: SDA with forced exploration xp RB
Download tables as Excel
Funding
  • Acknowledgments and Disclosure of Funding The PhD of Dorian Baudry is funded by a CNRS80 grant
  • The authors acknowledge the funding of the French National Research Agency under projects BADASS (ANR-16-CE40-0002) and BOLD (ANR-19-CE23-0026-04)
  • Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)
Reference
  • Akram Baransi, Odalric-Ambrym Maillard, and Shie Mannor. Sub-sampling for multi-armed bandits. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014. Proceedings, Part I, 2014.
    Google ScholarLocate open access versionFindings
  • Hock Peng Chan. The multi-armed bandit problem: An efficient nonparametric solution. The Annals of Statistics, 48(1), Feb 2020.
    Google ScholarLocate open access versionFindings
  • Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019.
    Google ScholarFindings
  • R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4), 1995.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3), 2002.
    Google ScholarLocate open access versionFindings
  • Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3), Jun 2013.
    Google ScholarLocate open access versionFindings
  • William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4), 12 1933.
    Google ScholarLocate open access versionFindings
  • Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, 2012.
    Google ScholarFindings
  • T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 1985.
    Google ScholarLocate open access versionFindings
  • Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory - 23rd International Conference, ALT 2012, 2012.
    Google ScholarFindings
  • Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, 2013.
    Google ScholarLocate open access versionFindings
  • N. Korda, E. Kaufmann, and R. Munos. Thompson Sampling for 1-dimensional Exponential family bandits. In Advances in Neural Information Processing Systems (NIPS), 2013.
    Google ScholarLocate open access versionFindings
  • Branislav Kveton, Csaba Szepesvari, Zheng Wen, Mohammad Ghavamzadeh, and Tor Lattimore. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbedhistory exploration in stochastic multi-armed bandits. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.
    Google ScholarFindings
  • Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration. CoRR, abs/1507.00300, 2015.
    Findings
  • Chi-Hua Wang, Yang Yu, Botao Hao, and Guang Cheng. Residual bootstrap exploration for bandit algorithms. CoRR, abs/2002.08436, 2020.
    Findings
  • Charles Riou and Junya Honda. Bandit algorithms based on thompson sampling for bounded reward distributions. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.
    Google ScholarLocate open access versionFindings
  • A.N Burnetas and M. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 1996.
    Google ScholarLocate open access versionFindings
  • Michael Drmota and Robert Tichy. Sequences, discrepancies and applications, volume 1651 of Lecture Notes in Mathematics. Springer Verlag, Deutschland, 1 edition, 1997.
    Google ScholarFindings
  • J. H. Halton. Algorithm 247: Radical-inverse quasi-random point sequence. Commun. ACM, 7(12), December 1964.
    Google ScholarLocate open access versionFindings
  • I.M Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4), 1967.
    Google ScholarLocate open access versionFindings
  • Junya Honda and Akimichi Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Journal of Machine Learning Research, 16(113), 2015.
    Google ScholarLocate open access versionFindings
  • Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In 24th Annual Conference on Learning Theory (COLT), 2011.
    Google ScholarLocate open access versionFindings
  • Alex Mendelson, Maria Zuluaga, Brian Hutton, and Sébastien Ourselin. What is the distribution of the number of unique original items in a bootstrap sample? CoRR, abs/1602.05822, 2016.
    Findings
  • B.C. Rennie and A.J. Dobson. On stirling numbers of the second kind. Journal of Combinatorial Theory, 7(2), 1969.
    Google ScholarLocate open access versionFindings
  • Anthony W. Ledford and Jonathan A. Tawn. Modelling dependence within joint tail regions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(2), May 1997.
    Google ScholarLocate open access versionFindings
  • 0. The following conditions are sufficient to ensure these properties:
    Google ScholarFindings
  • 1. If k ≤ αH, αe ≤ 1 and (1 − 2α) > 0 then: ke H
    Google ScholarLocate open access versionFindings
  • 1. Plugging that expression into the sum gives: (A) ≤ Cλ jμ2 f1(k) M λ k=0 f2(k)λ
    Google ScholarFindings
  • 3. For the second term we have to study ure−I1(μk)fr: ure−I1(μk)fr = exp (log ur − I1(μk)fr) = exp (3 log log r − log r − I1(μk)fr)
    Google ScholarFindings
Author
Dorian Baudry
Dorian Baudry
Emilie Kaufmann
Emilie Kaufmann
Odalric-Ambrym Maillard
Odalric-Ambrym Maillard
Your rating :
0

 

Tags
Comments
小科