Structured Linear Contextual Bandits: A Sharp and Geometric Smoothed Analysis

ICML, pp. 9026-9035, 2020.

Cited by: 1|Views22
EI
Weibo:
While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, our analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework

Abstract:

Bandit learning algorithms typically involve the balance of exploration and exploitation. However, in many practical applications, worst-case scenarios needing systematic exploration are seldom encountered. In this work, we consider a smoothed setting for structured linear contextual bandits where the adversarial contexts are perturbed ...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • Contextual bandits [22] is a powerful framework for sequential decision-making, with many applications to clinical trials, web search, and content optimization.
  • The goal of the algorithm is to select arms to maximize rewards over time observing only the available contexts and the reward associated with the selected context in each round
  • Such algorithms typically need to balance exploration, making potentially sub-optimal decisions for the sake of information acquisition, and exploitation, selecting decisions that are optimal based on the estimate of θ∗.
  • Greedy algorithm which myopically selects contexts maximizing rewards based on the current parameter estimate θ, i.e., choosing xtit = argmax xti, θare known to be sub-optimal in the worst case.
Highlights
  • Contextual bandits [22] is a powerful framework for sequential decision-making, with many applications to clinical trials, web search, and content optimization
  • The goal of the algorithm is to select arms to maximize rewards over time observing only the available contexts and the reward associated with the selected context in each round
  • Greedy algorithm which myopically selects contexts maximizing rewards based on the current parameter estimate θ, i.e., choosing xtit = argmax xti, θare known to be sub-optimal in the worst case
  • The work of [21, 27] provide a smoothed analysis on the greedy algorithm under the following setting: in each round the contexts xti, 1 ≤ i ≤ k are of the form μti + git, 1 ≤ i ≤ k, where the μti ∈ Rp’s are possibly selected adverserially with the constraint μti 2 ≤ 1 and git ∼ N (0, σ2Ip×p) are random Gaussian perturbations independent of the μti’s
  • The answer is in the result of Lemma 3, where we show that even in the adverserial setting the minimum eigenvalue of the covariance matrix of each row of the design matrix is no worse than the completely stochastic Gaussian setting
  • While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, our analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework
Results
  • The authors' analysis significantly improves on the bounds obtained in [21].
Conclusion
  • The authors analyzed the structured linear contextual bandit problem under the smoothed analysis framework.
  • The authors' analysis significantly improves on the bounds obtained in [21].
  • While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, the analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework
Summary
  • Introduction:

    Contextual bandits [22] is a powerful framework for sequential decision-making, with many applications to clinical trials, web search, and content optimization.
  • The goal of the algorithm is to select arms to maximize rewards over time observing only the available contexts and the reward associated with the selected context in each round
  • Such algorithms typically need to balance exploration, making potentially sub-optimal decisions for the sake of information acquisition, and exploitation, selecting decisions that are optimal based on the estimate of θ∗.
  • Greedy algorithm which myopically selects contexts maximizing rewards based on the current parameter estimate θ, i.e., choosing xtit = argmax xti, θare known to be sub-optimal in the worst case.
  • Results:

    The authors' analysis significantly improves on the bounds obtained in [21].
  • Conclusion:

    The authors analyzed the structured linear contextual bandit problem under the smoothed analysis framework.
  • The authors' analysis significantly improves on the bounds obtained in [21].
  • While previous work have found it difficult to extend exploration strategies to the structured setting with simultaneously exploiting the structure in the parameter, the analysis shows that a simple greedy algorithm achieves sublinear regret under the smoothed bandits framework
Funding
  • The research was supported by NSF grants OAC-1934634, IIS-1908104, IIS-1563950, IIS1447566, IIS-1447574, IIS-1422557, CCF-1451986, FAI-1939606, a Google Faculty Research Award, a J.P
  • Morgan Faculty Award, and a Mozilla research grant
Reference
  • Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems. In Conference on Learning Theory (COLT), 2011.
    Google ScholarLocate open access versionFindings
  • Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
    Google ScholarLocate open access versionFindings
  • Shipra Agarwal and Navin Goyal. Thompson Sampling for Contextual Bandits with Linear Payoffs. In International Conference on Machine Learning (ICML), 2013.
    Google ScholarLocate open access versionFindings
  • Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse Prediction with the k-Support Norm. In Neural Information Processing Systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • Arindam Banerjee, Sheng Chen, Farideh Fazayeli, and Vidyashankar Sivakumar. Estimation with Norm Regularization. In Neural Information Processing Systems (NIPS), 2014.
    Google ScholarLocate open access versionFindings
  • Arindam Banerjee, Qilong Gu, Vidyashankar Sivakumar, and Zhiwei Steven Wu. Random quadratic forms with dependence: Applications to restricted isometry and beyond. In Advances in Neural Information Processing Systems (NIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. CoRR arXiv:1704.09011, 2018. Working paper.
    Findings
  • Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
    Google ScholarLocate open access versionFindings
  • Alberto Bietti, Alekh Agarwal, and John Langford. Practical evaluation and optimization of contextual bandit algorithms. CoRR arXiv:1802.04064, 2018.
    Findings
  • Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or exploiting? social and ethical implications of automonous experimentation. In Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • Emmanuel J. Candes and Benjamin Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
    Google ScholarLocate open access versionFindings
  • Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
    Google ScholarLocate open access versionFindings
  • Sheng Chen and Arindam Banerjee. Structured Estimation with Atomic Norms: General Bounds and Applications. In Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
    Google ScholarLocate open access versionFindings
  • Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic Linear Optimization Under Bandit Feedback. In Conference on Learning Theory (COLT), 2008.
    Google ScholarLocate open access versionFindings
  • Y. Gordon. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50(4):265– 289, 1985.
    Google ScholarLocate open access versionFindings
  • Ramon van Handel. Probability in High Dimensions. Technical report, Princeton University, 2014.
    Google ScholarFindings
  • L. Jacob, O. Obozinski, and J. P. Vert. Group Lasso with Overlap and Graph Lasso. In International Conference on Machine Learning (ICML), number 2009, 2009.
    Google ScholarLocate open access versionFindings
  • Adel Javanmard and Hamid Javadi. Dynamic Pricing in High Dimensions. Accepted in JMLR, 2018.
    Google ScholarLocate open access versionFindings
  • Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9:1150– 1172, 2015.
    Google ScholarLocate open access versionFindings
  • Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR arXiv:1801.04323, 2018.
    Findings
  • John Langford and Tong Zhang. The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits. In Advances in Neural Information Processing Systems (NIPS), 2007.
    Google ScholarLocate open access versionFindings
  • Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In International World Wide Web Conference (WWW), 2010.
    Google ScholarLocate open access versionFindings
  • Yishay Mansour, Aleksandrs Slivkins, and Zhiwei Steven Wu. Competing bandits: Learning under competition. In Innovations in Theoretical Computer Science (ITCS), 2018.
    Google ScholarLocate open access versionFindings
  • S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Reconstruction and subGaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17:1248–1282, 2007.
    Google ScholarLocate open access versionFindings
  • Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Unified Framework for HighDimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538–557, 2012.
    Google ScholarLocate open access versionFindings
  • Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. In Conference on Learning Theory (COLT), pages 1724–1738, 2018.
    Google ScholarLocate open access versionFindings
  • V. Sivakumar, A. Banerjee, and P. Ravikumar. Beyond sub-gaussian measurements: High-dimensional structured estimation with sub-exponential designs. In Advances in Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • Vidyashankar Sivakumar and Arindam Banerjee. High-Dimensional Structured Quantile Regression. In International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Michel Talagrand. The Generic Chaining. Springer Monographs in Mathematics. Springer Berlin, 2005.
    Google ScholarFindings
  • Michel Talagrand. Upper and Lower Bounds of Stochastic Processes. Springer, 2014.
    Google ScholarFindings
  • Robert Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996.
    Google ScholarLocate open access versionFindings
  • Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y Eldar and G. Kutyniok, editors, Compressed Sensing, pages 210–268. Cambridge University Press, Cambridge, nov 2012.
    Google ScholarFindings
  • Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
    Google ScholarLocate open access versionFindings
  • Martin Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press (To appear), 2019.
    Google ScholarFindings
  • Ming Yuan and Yi Lin. Model Selection and Estimation in Regression With Grouped Variables. Journal of the Royal Statistical Society, 68(1):49–67, 2006.
    Google ScholarLocate open access versionFindings
  • 2. Width is invariant under taking the convex hull.
    Google ScholarFindings
  • 1. Minimum eigenvalue condition: Lower bounds for inf u∈A
    Google ScholarFindings
  • 22. Remember that Z(e) ∈ RTe×p is the design matrix before the Puffer transformation. We make the following observations: 1 Te
    Google ScholarLocate open access versionFindings
  • 22. We use the following result [5, 25].
    Google ScholarFindings
  • 2. Note that it follows from Lemma 10 that zt − E[zt], u is a c2σ-sub-Gaussian random variables, i.e., zt − E[zt], u ψ2 ≤ c2σ. Therefore from the Hoeffding inequality of Lemma 7: P
    Google ScholarFindings
  • 3. Estimation Error: Putting it all Together
    Google ScholarFindings
  • 1. Lower bounds for inf u∈A
    Google ScholarFindings
  • 2. In time step t, a learner chooses one among k contexts {xt1,..., xtk} based on historical data Ht−1. Let zt denote the selected context and gt denote the corresponding Gaussian perturbation. In the context of GM3, we denote the centered Gaussian perturbation gt − E[gt] by ξt. The learner receives the noisy reward yt = zt, θ∗ + ωt where ωt is an unknown sub-Gaussian noise. History at time step t is now augmented with the new data, i.e., Ht = Ht−1 ∪ {{xt1,..., xtk}, zt, yt}.
    Google ScholarFindings
  • 3. Now similar to step 1, the contexts in (B2p)k perturbed with Gaussian noise time step and Ht ∪
    Google ScholarFindings
  • 3. Estimation Error: Putting it all Together Again by following similar arguments as Theorem 4, we obtain the following estimation error bounds with probability atleast 1 − δ exp(−η2w2(A)) − 2δ:
    Google ScholarFindings
  • 4. Define the following quantities r ≤ c3σ log(T k), γ = c12κω(w(A)+
    Google ScholarFindings
Your rating :
0

 

Tags
Comments