# Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Weibo:

Abstract:

We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on ...More

Code:

Data:

ZH

Introduction

- The multi-armed bandit (MAB) model (Berry and Fristedt 1985; Sutton and Barto 1998) has found wide applications in Internet services, e.g., (Chen et al 2018; Chapelle, Manavoglu, and Rosales 2015; Chen, Wang, and Yuan 2013; Jain and Jamieson 2018; Wang and Huang 2018), and attracts increasing attention.
- The rewards can be independent random variables generated from certain unknown distributions, known as the stochastic MAB problem (Lai and Robbins 1985), or arbitrarily chosen by the environment, called the adversarial MAB problem (Auer et al 2002)
- In both models, the player’s goal is to maximize his expected cumulative reward during the game by choosing arms properly.
- To evaluate the player’s performance, the concept of “regret”, defined as the expected gap between the player’s total reward and offline optimal reward, is introduced as the evaluation metric

Highlights

- The multi-armed bandit (MAB) model (Berry and Fristedt 1985; Sutton and Barto 1998) has found wide applications in Internet services, e.g., (Chen et al 2018; Chapelle, Manavoglu, and Rosales 2015; Chen, Wang, and Yuan 2013; Jain and Jamieson 2018; Wang and Huang 2018), and attracts increasing attention
- With a proper round size increasing rate, our adaptive policies always possess theoretical guarantees on regrets in both the stochastic case and the adversarial case
- We propose the Adaptive Round-Size UCB (ARS-UCB) algorithm, which requires zero a-prior knowledge about the reward interval size
- Conclusion on experimental results In the above experiments, we observe that the cumulative regrets of ARS-UCB are always logarithmic in T, which is expected from our
- The results are consistent with our analysis, and show that our algorithms outperform state-of-the-art benchmarks
- Our future research includes deriving a matching regret lower bound for the non-oblivious adversarial case

Results

- The results are consistent with the analysis, and show that the algorithms outperform state-of-the-art benchmarks.

Conclusion

- In all the experiments, ARSUCB significantly outperforms ODAAF, which assumes full knowledge of the delay size.
- The authors consider the MAB problem with composite and anonymous feedback, both the stochastic and adversarial settings.
- For the former case, the authors propose the ARSUCB algorithm, and for the latter case, the authors design the ARSEXP3 algorithm.
- How to adapt the framework and obtain tight regret upper bounds for BCO is another interesting future research problem

Summary

## Introduction:

The multi-armed bandit (MAB) model (Berry and Fristedt 1985; Sutton and Barto 1998) has found wide applications in Internet services, e.g., (Chen et al 2018; Chapelle, Manavoglu, and Rosales 2015; Chen, Wang, and Yuan 2013; Jain and Jamieson 2018; Wang and Huang 2018), and attracts increasing attention.- The rewards can be independent random variables generated from certain unknown distributions, known as the stochastic MAB problem (Lai and Robbins 1985), or arbitrarily chosen by the environment, called the adversarial MAB problem (Auer et al 2002)
- In both models, the player’s goal is to maximize his expected cumulative reward during the game by choosing arms properly.
- To evaluate the player’s performance, the concept of “regret”, defined as the expected gap between the player’s total reward and offline optimal reward, is introduced as the evaluation metric
## Results:

The results are consistent with the analysis, and show that the algorithms outperform state-of-the-art benchmarks.## Conclusion:

In all the experiments, ARSUCB significantly outperforms ODAAF, which assumes full knowledge of the delay size.- The authors consider the MAB problem with composite and anonymous feedback, both the stochastic and adversarial settings.
- For the former case, the authors propose the ARSUCB algorithm, and for the latter case, the authors design the ARSEXP3 algorithm.
- How to adapt the framework and obtain tight regret upper bounds for BCO is another interesting future research problem

Related work

- Stochastic MAB with delayed feedback is first proposed in (Joulani, Gyorgy, and Szepesvari 2013; Agarwal and Duchi 2011; Desautels, Krause, and Burdick 2014). In (Joulani, Gyorgy, and Szepesvari 2013), the authors propose a BOLD framework to solve this problem. In this framework, the player only changes his decision when a feedback arrives. Then, decision making can be done the same as with nondelayed feedback. They show that the regret of BOLD can be upper bounded by O(N (log T + E[d])), where d represents the random variable of delay. (Manegueu et al 2020) then explored the case that the delay in each time slot is not i.i.d., but depends on the chosen arm. In this setting, they proposed the PatientBandits policy, which achieves near optimal regret upper bound. In addition to the stochastic case, adversarial MAB with delayed feedback also attracts people’s attention. This model is first studied in (Weinberger and Ordentlich 2002), where it is assumed that the player has full feedback. The paper establishes a regret lower bound of Ω( (d + 1)T log N ) for this model, where d is a constant feedback delay. The model with bandit feedback is investigated in (Neu et al 2010, 2014), where the authors used the BOLD framework (Joulani, Gyorgy, and Szepesvari 2013) to obtain a regret upper bound of O( (d + 1)T N ). Recently, (Zhou, Xu, and Blanchet 2019; Thune, Cesa-Bianchi, and Seldin 2019; Bistritz et al 2019) made more optimizations on MAB with delayed feedback. Since their analytical methods are used in the non-anonymous setting, they are very different and cannot be used for our purpose.

Funding

- The results are consistent with our analysis, and show that our algorithms outperform state-of-the-art benchmarks

Reference

- Agarwal, A.; and Duchi, J. C. 201Distributed delayed stochastic optimization. In Neural Information Processing Systems, 873–881.
- Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 200Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3): 235–256.
- Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The Non-Stochastic Multi-Armed Bandit Problem. Siam Journal on Computing 32(1): 48–77.
- Berry, D. A.; and Fristedt, B. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer.
- Bistritz, I.; Zhou, Z.; Chen, X.; Bambos, N.; and Blanchet, J. 2019. Online exp3 learning in adversarial bandits with delayed feedback. In Neural Information Processing Systems, 11345–11354.
- Cesa-Bianchi, N.; Dekel, O.; and Shamir, O. 2013. Online learning with switching costs and other adaptive adversaries. In Neural Information Processing Systems, 1160–1168.
- Cesa-Bianchi, N.; Gentile, C.; and Mansour, Y. 2018. Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, 750–773.
- Chapelle, O.; Manavoglu, E.; and Rosales, R. 2015. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST) 5(4): 61.
- Chen, K.; Cai, K.; Huang, L.; and Lui, J. C. 2018. Beyond the click-through rate: web link selection with multi-level feedback. In International Joint Conference on Artificial Intelligence, 3308–3314.
- Chen, W.; Wang, Y.; and Yuan, Y. 2013. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, 151–159.
- Dekel, O.; Ding, J.; Koren, T.; and Peres, Y. 2014. Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, 459–467.
- Desautels, T.; Krause, A.; and Burdick, J. W. 2014. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research 15: 3873–3923.
- Garg, S.; and Akash, A. K. 2019. Stochastic bandits with delayed composite anonymous feedback. arXiv preprint arXiv:1910.01161.
- Gittins, J. 1989. Multi-armed bandit allocation indices. Wiley-Interscience series in systems and optimization.
- Hirsch, I. B.; and Brownlee, M. 2005. Should minimal blood glucose variability become the gold standard of glycemic control? Journal of Diabetes and Its Complications 19(3): 178–181.
- Jain, L.; and Jamieson, K. 2018. Firing bandits: Optimizing crowdfunding. In International Conference on Machine Learning, 2211–2219.
- Joulani, P.; Gyorgy, A.; and Szepesvari, C. 2013. Online learning under delayed feedback. In International Conference on Machine Learning, 1453–1461.
- Kaggle. 2015. Coupon Purchase Prediction data. https://www.kaggle.com/c/coupon-purchase-prediction.
- https://www.kaggle.com/c/outbrain-click-prediction.
- Lai, T. L.; and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1): 4–22.
- Lu, T.; Pal, D.; and Pal, M. 2010. Contextual multi-armed bandits. In International conference on Artificial Intelligence and Statistics, 485–492.
- Manegueu, A. G.; Vernade, C.; Carpentier, A.; and Valko, M. 2020. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning.
- Neu, G.; Antos, A.; Gyorgy, A.; and Szepesvari, C. 2010. Online Markov decision processes under bandit feedback. In Neural Information Processing Systems, 1804–1812.
- Neu, G.; Gyorgy, A.; Szepesvari, C.; and Antos, A. 2014. Online Markov Decision Processes Under Bandit Feedback. IEEE Transactions on Automatic Control 59(3): 676–691.
- Pike-Burke, C.; Agrawal, S.; Szepesvari, C.; and Grunewalder, S. 2018. Bandits with Delayed, Aggregated Anonymous Feedback. In International Conference on Machine Learning, 4105–4113.
- Slivkins, A. 2014. Contextual bandits with similarity information. The Journal of Machine Learning Research 15(1): 2533–2568.
- Sutton, R. S.; and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
- Thune, T. S.; Cesa-Bianchi, N.; and Seldin, Y. 2019. Nonstochastic Multi-armed Bandits with Unrestricted Delays. In Neural Information Processing Systems.
- Vernade, C.; Cappe, O.; and Perchet, V. 2017. Stochastic Bandit Models for Delayed Conversions. In Conference on Uncertainty in Artificial Intelligence.
- Wang, S.; and Huang, L. 2018. Multi-armed bandits with compensation. In Neural Information Processing Systems, 5114–5122.
- Weinberger, M. J.; and Ordentlich, E. 2002. On delayed prediction of individual sequences. international symposium on information theory 48(7): 1959–1976.
- Zhou, Z.; Xu, R.; and Blanchet, J. 2019. Learning in generalized linear contextual bandits with stochastic delays. In Neural Information Processing Systems, 5198–5209.

Tags

Comments