# Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

IJCAI 2020, pp. 2936-2942, 2020.

EI

Weibo:

Abstract:

In this paper, we study the problem of stochastic linear bandits with finite action sets. Most of existing work assume the payoffs are bounded or sub-Gaussian, which may be violated in some scenarios such as financial markets. To settle this issue, we analyze the linear bandits with heavy-tailed payoffs, where the payoffs admit finite $...More

Code:

Data:

Introduction

- Bandit online learning is a powerful framework for modeling various important decision-making scenarios with applications ranging from medical trials to advertisement placement to network routing [Bubeck and Cesa-Bianchi, 2012].
- Page features in advertisement placement [Abe et al, 2003a], which could guide the decision-making process
- To address this issue, various algorithms have been developed to exploit the contexts, based on different structures of the payoff functions such as Lipschitz [Kleinberg et al, 2008; Bubeck et al, 2011] or convex [Agarwal et al, 2013; Bubeck et al, 2015].
- He/she selects an arm at and receives payoff rt,at , such that

Highlights

- Bandit online learning is a powerful framework for modeling various important decision-making scenarios with applications ranging from medical trials to advertisement placement to network routing [Bubeck and Cesa-Bianchi, 2012]
- In the basic stochastic multi-arm bandits (MAB) [Robbins, 1952], a learner repeatedly selects one from K arms to play, and observes a payoff drawn from a fixed but unknown distribution associated with the chosen arm
- The stochastic linear bandits (SLB) has received significant research interests [Auer, 2002; Chu et al, 2011], in which the expected payoff at each round is assumed to be a linear combination of features in the context vector
- We demonstrate a lower bound for the muti-armed bandits (MAB) with heavy-tailed payoffs first, and extend it to stochastic linear bandits (SLB) by a proper design of the contextual feature vectors xt,a and coefficient vector θ∗
- We demonstrate a lower bound for the multi-arm bandits with heavy-tailed payoffs first, where we adopt the techniques proposed by Auer et al [2002]
- We develop two novel algorithms to settle the heavy-tailed issue in linear contextual bandit with finite arms

Methods

- The authors conduct experiments to evaluate the proposed algorithms.
- The authors adopt MoM and CRT of Medina and Yang [2016], MENU and TOFU of Shao et al [2018] as baselines for comparison.
- Each element of the vector xt,a is sampled from the uniform distribution of [0, 1], and the vector is normalized to a unit vector.
- According to the linear bandit model, the observed payoff is rt,a = xt,aθ∗ + ηt where ηt is generated from the following two noises

Conclusion

**Conclusion and Future Work**

In this paper, the authors develop two novel algorithms to settle the heavy-tailed issue in linear contextual bandit with finite arms.

The authors' algorithms only require the existence of bounded 1 +

moment of payoffs, and achieve regret bo√und which is tighter than that of Shao et al [2018] by an O( d) factor for finite action sets.- The authors develop two novel algorithms to settle the heavy-tailed issue in linear contextual bandit with finite arms.
- The authors' algorithms only require the existence of bounded 1 +.
- The authors provide a lower bound on the order of Ω(d 1+ T 1+ ).
- The authors' proposed algorithms have been evaluated based on numerical experiments and the empirical results demonstrate the effectiveness in addressing heavy-tailed problem.
- The authors will investigate more on closing the gap between upper bound and lower bound with respect to the dimension d

Summary

## Introduction:

Bandit online learning is a powerful framework for modeling various important decision-making scenarios with applications ranging from medical trials to advertisement placement to network routing [Bubeck and Cesa-Bianchi, 2012].- Page features in advertisement placement [Abe et al, 2003a], which could guide the decision-making process
- To address this issue, various algorithms have been developed to exploit the contexts, based on different structures of the payoff functions such as Lipschitz [Kleinberg et al, 2008; Bubeck et al, 2011] or convex [Agarwal et al, 2013; Bubeck et al, 2015].
- He/she selects an arm at and receives payoff rt,at , such that
## Methods:

The authors conduct experiments to evaluate the proposed algorithms.- The authors adopt MoM and CRT of Medina and Yang [2016], MENU and TOFU of Shao et al [2018] as baselines for comparison.
- Each element of the vector xt,a is sampled from the uniform distribution of [0, 1], and the vector is normalized to a unit vector.
- According to the linear bandit model, the observed payoff is rt,a = xt,aθ∗ + ηt where ηt is generated from the following two noises
## Conclusion:

**Conclusion and Future Work**

In this paper, the authors develop two novel algorithms to settle the heavy-tailed issue in linear contextual bandit with finite arms.

The authors' algorithms only require the existence of bounded 1 +

moment of payoffs, and achieve regret bo√und which is tighter than that of Shao et al [2018] by an O( d) factor for finite action sets.- The authors develop two novel algorithms to settle the heavy-tailed issue in linear contextual bandit with finite arms.
- The authors' algorithms only require the existence of bounded 1 +.
- The authors provide a lower bound on the order of Ω(d 1+ T 1+ ).
- The authors' proposed algorithms have been evaluated based on numerical experiments and the empirical results demonstrate the effectiveness in addressing heavy-tailed problem.
- The authors will investigate more on closing the gap between upper bound and lower bound with respect to the dimension d

Related work

- In this section, we briefly review the related work on bandit learning. The p-norm of vector x ∈ Rd is x p = (|x1|p + . . . + |xd|p)1/p and the 2-norm is denoted as · .

2.1 Bandit Learning with Bounded/Sub-Gaussian Payoffs

The celebrated work of Lai and Robbins [1985] derived a lower bound of Ω(K log T ) for stochastic MAB, and proposed an algorithm which achieves the lower bound asymptotically by making use of the upper confidence bound (UCB) policies. Auer [2002] studied the problem of stochastic linear bandits, and developed a basic algorithm named LinRel to solve this problem. However, he failed to provide a sublinear regret for LinRel since the analysis of the algorithm requires all observed payoffs so far to be independent random variables, which may be violated. To resolve this problem, he turned LinRel to be a subroutine which assumes independence among the payoffs, and then constructed a master algorithm named SupLinRel to ensure the independence. The√oretical analysis demonstrates that SupLinRel enjoys an O( dT ) regret bound, assuming the number of arms is finite. Chu et al [2011] modified LinRel and SupLinRel slightly to BaseLinUCB and SupLinUCB, which enjoy similar regret bound but less computational cost √and easier theoretical analysis. They also provided an Ω( dT ) lower bound for SLB. Dani et al [2008] considered the setting where the arm set is infinite, and proposed an algorithm n√amed ConfidenceBall2 which enjoys a regret bound of O(d T ). Later, Abbasi-yadkori et al [2011] provided a new analysis of ConfidenceBall2, and improved the worst case bound by a logarithmic factor.

Reference

- [Abbasi-yadkori et al., 2011] Yasin Abbasi-yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320. 2011.
- [Abe et al., 2003a] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263– 293, 2003.
- [Abe et al., 2003b] Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263– 293, 2003.
- [Agarwal et al., 2013] A. Agarwal, D. Foster, D. Hsu, S. Kakade, and A. Rakhlin. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213–240, 2013.
- [Audibert and Catoni, 2011] Jean-Yves Audibert and Olivier Catoni. Robust linear least squares regression. The Annals of Statistics, 39(5):2766–2794, 2011.
- [Auer et al., 2002] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
- [Auer, 2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
- [Azuma, 1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Math. J. (2), 19(3):357–367, 1967.
- [Brownlees et al., 2015] Christian Brownlees, Emilien Joly, and Gabor Lugosi. Empirical risk minimization for heavytailed losses. The Annals of Statistics, 43(6):2507–2536, 2015.
- [Bubeck and Cesa-Bianchi, 2012] Sebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- [Bubeck et al., 2011] Sebastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the lipschitz constant. In Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, pages 144–158, 2011.
- [Bubeck et al., 2013] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
- [Bubeck et al., 2015] Sebastien Bubeck, Ofer Dekel, To√mer Koren, and Yuval Peres. Bandit convex optimization: T regret in one dimension. In Proceedings of The 28th Conference on Learning Theory, volume 40, pages 266–278, 2015.
- [Catoni, 2012] Olivier Catoni. Challenging the empirical mean and empirical variance: A deviation study. Annales de l’I.H.P. Probabilites et statistiques, 48(4):1148–1185, 2012.
- [Chu et al., 2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 208– 214, 2011.
- [Cont and Bouchaud, 2000] Rama Cont and Jean-Philipe Bouchaud. Herd behavior and aggregate fluctuations in financial markets. Macroeconomic Dynamics, 4(02):170– 196, 2000.
- [Cover and Thomas, 2006] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, New York, NY, USA, 2006.
- [Dani et al., 2008] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning, pages 355–366, 2008.
- [Foss et al., 2013] Sergey Foss, Dmitry Korshunov, and Stan Zachary. An Introduction to Heavy-Tailed and Subexponential Distributions. Springer, 2013.
- [Golub and Van Loan, 1996] Gene H. Golub and Charles F. Van Loan. Matrix computations, 3rd Edition. Johns Hopkins University Press, 1996.
- [Hsu and Sabato, 2016] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1–40, 2016.
- [Kleinberg et al., 2008] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681–690, 2008.
- [Lai and Robbins, 1985] T. L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- [Lu et al., 2019] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 4154–4163, 2019.
- [Medina and Yang, 2016] Andres Munoz Medina and Scott Yang. No-regret algorithms for heavy-tailed linear bandits. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 1642–1650, 2016.
- [Robbins, 1952] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
- [Roberts et al., 2015] James A Roberts, Tjeerd W Boonstra, and Michael Breakspear. The heavy tail of the human brain. Current Opinion in Neurobiology, 31:164–172, 2015.
- [Seldin et al., 2011] Yevgeny Seldin, Francois Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. Pac-bayesian inequalities for martingales. CoRR, 2011.
- [Shao et al., 2018] Han Shao, Xiaotian Yu, Irwin King, and Michael R. Lyu. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. In Advances in Neural Information Processing Systems 32, pages 8430–8439, 2018.
- [Zhang and Zhou, 2018] Lijun Zhang and Zhi-Hua Zhou. 1regression with heavy-tailed distributions. In Advances in Neural Information Processing Systems 31, pages 1084– 1094, 2018.
- [Zhang et al., 2016] Lijun Zhang, Tianbao Yang, Rong Jin, Yichi Xiao, and Zhi-Hua Zhou. Online stochastic linear optimization under one-bit feedback. In Proceedings of the 33rd International Conference on Machine Learning, pages 392–401, 2016.

Full Text

Tags

Comments