# Structured Linear Contextual Bandits: A Sharp and Geometric Smoothed Analysis

ICML, pp. 9026-9035, 2020.

EI

Weibo:

Abstract:

Bandit learning algorithms typically involve the balance of exploration and exploitation. However, in many practical applications, worst-case scenarios needing systematic exploration are seldom encountered. In this work, we consider a smoothed setting for structured linear contextual bandits where the adversarial contexts are perturbed ...More

Code:

Data:

Introduction

- Contextual bandits [22] is a powerful framework for sequential decision-making, with many applications to clinical trials, web search, and content optimization.
- The goal of the algorithm is to select arms to maximize rewards over time observing only the available contexts and the reward associated with the selected context in each round
- Such algorithms typically need to balance exploration, making potentially sub-optimal decisions for the sake of information acquisition, and exploitation, selecting decisions that are optimal based on the estimate of θ∗.
- Greedy algorithm which myopically selects contexts maximizing rewards based on the current parameter estimate θ, i.e., choosing xtit = argmax xti, θare known to be sub-optimal in the worst case.

Highlights

- Contextual bandits [22] is a powerful framework for sequential decision-making, with many applications to clinical trials, web search, and content optimization
- The goal of the algorithm is to select arms to maximize rewards over time observing only the available contexts and the reward associated with the selected context in each round
- Greedy algorithm which myopically selects contexts maximizing rewards based on the current parameter estimate θ, i.e., choosing xtit = argmax xti, θare known to be sub-optimal in the worst case
- The work of [21, 27] provide a smoothed analysis on the greedy algorithm under the following setting: in each round the contexts xti, 1 ≤ i ≤ k are of the form μti + git, 1 ≤ i ≤ k, where the μti ∈ Rp’s are possibly selected adverserially with the constraint μti 2 ≤ 1 and git ∼ N (0, σ2Ip×p) are random Gaussian perturbations independent of the μti’s
- The answer is in the result of Lemma 3, where we show that even in the adverserial setting the minimum eigenvalue of the covariance matrix of each row of the design matrix is no worse than the completely stochastic Gaussian setting
- While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, our analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework

Results

- The authors' analysis significantly improves on the bounds obtained in [21].

Conclusion

- The authors analyzed the structured linear contextual bandit problem under the smoothed analysis framework.
- The authors' analysis significantly improves on the bounds obtained in [21].
- While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, the analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework

Summary

## Introduction:

Contextual bandits [22] is a powerful framework for sequential decision-making, with many applications to clinical trials, web search, and content optimization.- The goal of the algorithm is to select arms to maximize rewards over time observing only the available contexts and the reward associated with the selected context in each round
- Such algorithms typically need to balance exploration, making potentially sub-optimal decisions for the sake of information acquisition, and exploitation, selecting decisions that are optimal based on the estimate of θ∗.
- Greedy algorithm which myopically selects contexts maximizing rewards based on the current parameter estimate θ, i.e., choosing xtit = argmax xti, θare known to be sub-optimal in the worst case.
## Results:

The authors' analysis significantly improves on the bounds obtained in [21].## Conclusion:

The authors analyzed the structured linear contextual bandit problem under the smoothed analysis framework.- The authors' analysis significantly improves on the bounds obtained in [21].
- While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, the analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework

Funding

- The research was supported by NSF grants OAC-1934634, IIS-1908104, IIS-1563950, IIS1447566, IIS-1447574, IIS-1422557, CCF-1451986, FAI-1939606, a Google Faculty Research Award, a J.P
- Morgan Faculty Award, and a Mozilla research grant

Reference

- Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems. In Conference on Learning Theory (COLT), 2011.
- Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
- Shipra Agarwal and Navin Goyal. Thompson Sampling for Contextual Bandits with Linear Payoffs. In International Conference on Machine Learning (ICML), 2013.
- Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse Prediction with the k-Support Norm. In Neural Information Processing Systems (NIPS), 2012.
- Arindam Banerjee, Sheng Chen, Farideh Fazayeli, and Vidyashankar Sivakumar. Estimation with Norm Regularization. In Neural Information Processing Systems (NIPS), 2014.
- Arindam Banerjee, Qilong Gu, Vidyashankar Sivakumar, and Zhiwei Steven Wu. Random quadratic forms with dependence: Applications to restricted isometry and beyond. In Advances in Neural Information Processing Systems (NIPS), 2019.
- Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. CoRR arXiv:1704.09011, 2018. Working paper.
- Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
- Alberto Bietti, Alekh Agarwal, and John Langford. Practical evaluation and optimization of contextual bandit algorithms. CoRR arXiv:1802.04064, 2018.
- Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or exploiting? social and ethical implications of automonous experimentation. In Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2016.
- Emmanuel J. Candes and Benjamin Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
- Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
- Sheng Chen and Arindam Banerjee. Structured Estimation with Atomic Norms: General Bounds and Applications. In Neural Information Processing Systems (NIPS), 2015.
- Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
- Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic Linear Optimization Under Bandit Feedback. In Conference on Learning Theory (COLT), 2008.
- Y. Gordon. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50(4):265– 289, 1985.
- Ramon van Handel. Probability in High Dimensions. Technical report, Princeton University, 2014.
- L. Jacob, O. Obozinski, and J. P. Vert. Group Lasso with Overlap and Graph Lasso. In International Conference on Machine Learning (ICML), number 2009, 2009.
- Adel Javanmard and Hamid Javadi. Dynamic Pricing in High Dimensions. Accepted in JMLR, 2018.
- Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9:1150– 1172, 2015.
- Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR arXiv:1801.04323, 2018.
- John Langford and Tong Zhang. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Advances in Neural Information Processing Systems (NIPS), 2007.
- Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In International World Wide Web Conference (WWW), 2010.
- Yishay Mansour, Aleksandrs Slivkins, and Zhiwei Steven Wu. Competing bandits: Learning under competition. In Innovations in Theoretical Computer Science (ITCS), 2018.
- S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Reconstruction and subGaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17:1248–1282, 2007.
- Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Unified Framework for HighDimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538–557, 2012.
- Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. In Conference on Learning Theory (COLT), pages 1724–1738, 2018.
- V. Sivakumar, A. Banerjee, and P. Ravikumar. Beyond sub-gaussian measurements: High-dimensional structured estimation with sub-exponential designs. In Advances in Neural Information Processing Systems (NIPS), 2015.
- Vidyashankar Sivakumar and Arindam Banerjee. High-Dimensional Structured Quantile Regression. In International Conference on Machine Learning (ICML), 2017.
- Michel Talagrand. The Generic Chaining. Springer Monographs in Mathematics. Springer Berlin, 2005.
- Michel Talagrand. Upper and Lower Bounds of Stochastic Processes. Springer, 2014.
- Robert Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996.
- Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y Eldar and G. Kutyniok, editors, Compressed Sensing, pages 210–268. Cambridge University Press, Cambridge, nov 2012.
- Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- Martin Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press (To appear), 2019.
- Ming Yuan and Yi Lin. Model Selection and Estimation in Regression With Grouped Variables. Journal of the Royal Statistical Society, 68(1):49–67, 2006.
- 2. Width is invariant under taking the convex hull.
- 1. Minimum eigenvalue condition: Lower bounds for inf u∈A
- 22. Remember that Z(e) ∈ RTe×p is the design matrix before the Puffer transformation. We make the following observations: 1 Te
- 22. We use the following result [5, 25].
- 2. Note that it follows from Lemma 10 that zt − E[zt], u is a c2σ-sub-Gaussian random variables, i.e., zt − E[zt], u ψ2 ≤ c2σ. Therefore from the Hoeffding inequality of Lemma 7: P
- 3. Estimation Error: Putting it all Together
- 1. Lower bounds for inf u∈A
- 2. In time step t, a learner chooses one among k contexts {xt1,..., xtk} based on historical data Ht−1. Let zt denote the selected context and gt denote the corresponding Gaussian perturbation. In the context of GM3, we denote the centered Gaussian perturbation gt − E[gt] by ξt. The learner receives the noisy reward yt = zt, θ∗ + ωt where ωt is an unknown sub-Gaussian noise. History at time step t is now augmented with the new data, i.e., Ht = Ht−1 ∪ {{xt1,..., xtk}, zt, yt}.
- 3. Now similar to step 1, the contexts in (B2p)k perturbed with Gaussian noise time step and Ht ∪
- 3. Estimation Error: Putting it all Together Again by following similar arguments as Theorem 4, we obtain the following estimation error bounds with probability atleast 1 − δ exp(−η2w2(A)) − 2δ:
- 4. Define the following quantities r ≤ c3σ log(T k), γ = c12κω(w(A)+

Tags

Comments