## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Sub-sampling for Efficient Non-Parametric Bandit Exploration

NIPS 2020, (2020)

EI

Keywords

Abstract

In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our propos...More

Introduction

- A K-armed bandit problem is a sequential decision-making problem in which a learner sequentially samples from K unknown distributions called arms.
- K} and obtains a random reward Xt drawn from the distribution of the chosen arm, that has mean μAt. The learner should adjust her sequential sampling strategy A = (At)t∈N in order to maximize the expected sum of rewards obtained after T selections.
- TS is a randomized Bayesian algorithm that selects arms according to their posterior probability of being optimal
- These algorithms enjoy logarithmic regret under some assumptions on the arms, and some of them are even asymptotically optimal in that they attain the smallest possible asymptotic regret given by the lower bound of Lai & Robbins [9], for some parametric families of distributions.
- For distributions that are continuously parameterized by their means, this lower bound states that under any uniformly efficient algorithm, lim inf RT (A) ≥

Highlights

- A K-armed bandit problem is a sequential decision-making problem in which a learner sequentially samples from K unknown distributions called arms
- Our first objective is to check that for a finite horizon the regret of Random Block Sampling (RB)-sampling Duelling Algorithms (SDA) is comparable with the regret of Thompson Sampling, which efficiently uses the knowledge of the distribution
- Results reported in Tables 4 and 5 show that RB-SDA and Without Replacement (WR)-SDA are strong competitors to Thompson Sampling (TS) and IMED for both Bernoulli and Gaussian bandits
- The cost of sub-sampling varies across algorithms: in the general case RB-SDA is more efficient than WR-SDA as the latter requires to draw a random subset while the former only needs to draw the random integer starting the block
- We proved that one particular instance, RB-SDA, combines both optimal theoretical guarantees and good empirical performance for several distributions, possibly with unbounded support
- The empirical study presented in the paper shows the robustness of sub-sampling approach over other types of re-sampling algorithms. This new approach to exploration may be generalized in many directions, for example to contextual bandits or reinforcement learning, where Upper Confidence Bounds (UCB) and Thompson Sampling are still the dominant approaches

Methods

- The authors perform experiments on simulated data in order to illustrate the good performance of the four instances of SDA algorithms introduced in Section 2 for various distributions.
- Exponential families First, in order to illustrate Corollary 3.1.1, the authors investigate the performance of RB-SDA for both Bernoulli and Gaussian distributions.
- For Bernoulli and Gaussian distribution, Non-Parameteric TS coincides with Thompson Sampling, so the authors focus the study on algorithms based on history perturbation.
- The authors experiment with PHE [14] for Bernoulli bandits and ReBoot [17] for Gaussian bandits, as those two algorithms are guaranteed to have logarithmic regret in each of these settings.

Results

**Results reported in Tables**

4 and 5 show that RB-SDA and WR-SDA are strong competitors to TS and IMED for both Bernoulli and Gaussian bandits.- The cost of sub-sampling varies across algorithms: in the general case RB-SDA is more efficient than WR-SDA as the latter requires to draw a random subset while the former only needs to draw the random integer starting the block.
- The computational cost of these two algorithms is difficult to evaluate precisely
- They can be made very efficient when the leader does not change, but each change of leader is costly, in particular for SSMC.
- Non-Parametric TS has a good performance for Truncated Gaussian, but the cost of drawing a random probability vector over a large history is very high

Conclusion

- The authors introduced the SDA framework for exploration in bandits models. The authors proved that one particular instance, RB-SDA, combines both optimal theoretical guarantees and good empirical performance for several distributions, possibly with unbounded support.
- The empirical study presented in the paper shows the robustness of sub-sampling approach over other types of re-sampling algorithms
- This new approach to exploration may be generalized in many directions, for example to contextual bandits or reinforcement learning, where UCB and Thompson Sampling are still the dominant approaches.
- (K − 1)P (Xcr,cr,j < βr,j ) + αk(βr,j , j) k=2 as cr ≤ mr, which proves Lemma 4.3
- This definition allows to analyze separately the properties of the sub-sampling algorithms and the properties of the distribution family for randomized samplers.
- The authors aim at upper bounding the probability of

Summary

## Introduction:

A K-armed bandit problem is a sequential decision-making problem in which a learner sequentially samples from K unknown distributions called arms.- K} and obtains a random reward Xt drawn from the distribution of the chosen arm, that has mean μAt. The learner should adjust her sequential sampling strategy A = (At)t∈N in order to maximize the expected sum of rewards obtained after T selections.
- TS is a randomized Bayesian algorithm that selects arms according to their posterior probability of being optimal
- These algorithms enjoy logarithmic regret under some assumptions on the arms, and some of them are even asymptotically optimal in that they attain the smallest possible asymptotic regret given by the lower bound of Lai & Robbins [9], for some parametric families of distributions.
- For distributions that are continuously parameterized by their means, this lower bound states that under any uniformly efficient algorithm, lim inf RT (A) ≥
## Objectives:

The authors' objective is to bound T P(HkT ) and log(T )P(GkT ) by constants.## Methods:

The authors perform experiments on simulated data in order to illustrate the good performance of the four instances of SDA algorithms introduced in Section 2 for various distributions.- Exponential families First, in order to illustrate Corollary 3.1.1, the authors investigate the performance of RB-SDA for both Bernoulli and Gaussian distributions.
- For Bernoulli and Gaussian distribution, Non-Parameteric TS coincides with Thompson Sampling, so the authors focus the study on algorithms based on history perturbation.
- The authors experiment with PHE [14] for Bernoulli bandits and ReBoot [17] for Gaussian bandits, as those two algorithms are guaranteed to have logarithmic regret in each of these settings.
## Results:

**Results reported in Tables**

4 and 5 show that RB-SDA and WR-SDA are strong competitors to TS and IMED for both Bernoulli and Gaussian bandits.- The cost of sub-sampling varies across algorithms: in the general case RB-SDA is more efficient than WR-SDA as the latter requires to draw a random subset while the former only needs to draw the random integer starting the block.
- The computational cost of these two algorithms is difficult to evaluate precisely
- They can be made very efficient when the leader does not change, but each change of leader is costly, in particular for SSMC.
- Non-Parametric TS has a good performance for Truncated Gaussian, but the cost of drawing a random probability vector over a large history is very high
## Conclusion:

The authors introduced the SDA framework for exploration in bandits models. The authors proved that one particular instance, RB-SDA, combines both optimal theoretical guarantees and good empirical performance for several distributions, possibly with unbounded support.- The empirical study presented in the paper shows the robustness of sub-sampling approach over other types of re-sampling algorithms
- This new approach to exploration may be generalized in many directions, for example to contextual bandits or reinforcement learning, where UCB and Thompson Sampling are still the dominant approaches.
- (K − 1)P (Xcr,cr,j < βr,j ) + αk(βr,j , j) k=2 as cr ≤ mr, which proves Lemma 4.3
- This definition allows to analyze separately the properties of the sub-sampling algorithms and the properties of the distribution family for randomized samplers.
- The authors aim at upper bounding the probability of

- Table1: Regret at T = 20000 for Bernoulli arms
- Table2: Regret at T = 20000 for Gaussian arms
- Table3: Regret at T = 20000 for Truncated Gaussian arms
- Table4: Average Regret on 10000 random experiments with Bernoulli Arms
- Table5: Average Regret on 10000 random experiments with Gaussian Arms
- Table6: Regret and at T = 20000 for Bernoulli arms, with standard deviation
- Table7: Regret and at T = 20000 for Gaussian arms, with standard deviation
- Table8: Table 8
- Table9: Average Regret with Exponential Arms (with std)
- Table10: Quantile
- Table11: Average Regret with Exponential Arms: SDA with forced exploration xp RB

Funding

- Acknowledgments and Disclosure of Funding The PhD of Dorian Baudry is funded by a CNRS80 grant
- The authors acknowledge the funding of the French National Research Agency under projects BADASS (ANR-16-CE40-0002) and BOLD (ANR-19-CE23-0026-04)
- Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)

Reference

- Akram Baransi, Odalric-Ambrym Maillard, and Shie Mannor. Sub-sampling for multi-armed bandits. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014. Proceedings, Part I, 2014.
- Hock Peng Chan. The multi-armed bandit problem: An efficient nonparametric solution. The Annals of Statistics, 48(1), Feb 2020.
- Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019.
- R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4), 1995.
- Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3), 2002.
- Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3), Jun 2013.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4), 12 1933.
- Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, 2012.
- T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 1985.
- Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory - 23rd International Conference, ALT 2012, 2012.
- Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, 2013.
- N. Korda, E. Kaufmann, and R. Munos. Thompson Sampling for 1-dimensional Exponential family bandits. In Advances in Neural Information Processing Systems (NIPS), 2013.
- Branislav Kveton, Csaba Szepesvari, Zheng Wen, Mohammad Ghavamzadeh, and Tor Lattimore. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In ICML, 2019.
- Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbedhistory exploration in stochastic multi-armed bandits. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019.
- Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.
- Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration. CoRR, abs/1507.00300, 2015.
- Chi-Hua Wang, Yang Yu, Botao Hao, and Guang Cheng. Residual bootstrap exploration for bandit algorithms. CoRR, abs/2002.08436, 2020.
- Charles Riou and Junya Honda. Bandit algorithms based on thompson sampling for bounded reward distributions. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.
- A.N Burnetas and M. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 1996.
- Michael Drmota and Robert Tichy. Sequences, discrepancies and applications, volume 1651 of Lecture Notes in Mathematics. Springer Verlag, Deutschland, 1 edition, 1997.
- J. H. Halton. Algorithm 247: Radical-inverse quasi-random point sequence. Commun. ACM, 7(12), December 1964.
- I.M Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4), 1967.
- Junya Honda and Akimichi Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Journal of Machine Learning Research, 16(113), 2015.
- Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In 24th Annual Conference on Learning Theory (COLT), 2011.
- Alex Mendelson, Maria Zuluaga, Brian Hutton, and Sébastien Ourselin. What is the distribution of the number of unique original items in a bootstrap sample? CoRR, abs/1602.05822, 2016.
- B.C. Rennie and A.J. Dobson. On stirling numbers of the second kind. Journal of Combinatorial Theory, 7(2), 1969.
- Anthony W. Ledford and Jonathan A. Tawn. Modelling dependence within joint tail regions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(2), May 1997.
- 0. The following conditions are sufficient to ensure these properties:
- 1. If k ≤ αH, αe ≤ 1 and (1 − 2α) > 0 then: ke H
- 1. Plugging that expression into the sum gives: (A) ≤ Cλ jμ2 f1(k) M λ k=0 f2(k)λ
- 3. For the second term we have to study ure−I1(μk)fr: ure−I1(μk)fr = exp (log ur − I1(μk)fr) = exp (3 log log r − log r − I1(μk)fr)

Tags

Comments