Efficient Contextual Bandits with Continuous Actions

NIPS 2020, 2020.

Cited by: 1|Bibtex|Views26
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Contextual bandit learning for continuous actions with unknown structure is quite tractable via the CATS algorithm, as we have shown theoretically and empirically

Abstract:

We create a computationally tractable algorithm for contextual bandits with continuous actions having unknown structure. Our reduction-style algorithm composes with most supervised learning representations. We prove that it works in a general sense and verify the new functionality with large-scale experiments.

Code:

Data:

0
Introduction
  • In contextual bandit learning [6, 1, 37, 3] an agent repeatedly observes its environment, chooses an action, and receives a reward feedback, with the goal of optimizing cumulative reward.
  • In operating systems, when a computer makes a connection over the network, the authors may be able to adjust its packet send rate in response to the current network status [28]
  • All of these may be optimizable based on feedback and context
Highlights
  • In contextual bandit learning [6, 1, 37, 3] an agent repeatedly observes its environment, chooses an action, and receives a reward feedback, with the goal of optimizing cumulative reward
  • We propose CATS, a new algorithm for contextual bandits with continuous actions (Algorithm 1)
  • The focus of this paper is on contextual bandit algorithms with computational efficiency guarantees, in Appendix D, we present several extensions of our results to general policy classes
  • We evaluate our approach on six large-scale regression datasets, where regression predictions are treated as continuous actions in A = [0, 1]
  • Contextual bandit learning for continuous actions with unknown structure is quite tractable via the CATS algorithm, as we have shown theoretically and empirically
  • Our study of efficient contextual bandits with continuous actions can be applied to a wide range of applications, such as precision medicine, personalized recommendations, data center optimization, operating systems, networking, etc
Methods
  • The authors evaluate the approach on six large-scale regression datasets, where regression predictions are treated as continuous actions in A = [0, 1].
  • To simulate contextual bandit learning, the authors first perform scaling and offsetting to ensure yts are in [0, 1].
  • Every regression example is converted to, where t(a) = |a − yt| is the absolute loss induced by yt.
  • When action at is taken, the algorithm receives bandit feedback t(at), as opposed to the usual label yt.
  • The authors include a synthetic dataset ds, created by the linear regression model with additive Gaussian noise
Conclusion
  • The smoothing approach has several appealing properties. The authors look for a good interval of actions, which is possible even when the best single action is impossible to find.
  • The approach is principled, leading to specific, interpretable guarantees.Contextual bandit learning for continuous actions with unknown structure is quite tractable via the CATS algorithm, as the authors have shown theoretically and empirically.
  • The authors' study of efficient contextual bandits with continuous actions can be applied to a wide range of applications, such as precision medicine, personalized recommendations, data center optimization, operating systems, networking, etc
  • Many of these applications have potential for significant positive impact to society, but these methods can cause unintend harms, for example by creating filter bubble effects when deployed in recommendation engines.
  • The authors are certainly mindful of these issues, and encourage practitioners to consider these consequences when deploying interactive learning systems
Summary
  • Introduction:

    In contextual bandit learning [6, 1, 37, 3] an agent repeatedly observes its environment, chooses an action, and receives a reward feedback, with the goal of optimizing cumulative reward.
  • In operating systems, when a computer makes a connection over the network, the authors may be able to adjust its packet send rate in response to the current network status [28]
  • All of these may be optimizable based on feedback and context
  • Methods:

    The authors evaluate the approach on six large-scale regression datasets, where regression predictions are treated as continuous actions in A = [0, 1].
  • To simulate contextual bandit learning, the authors first perform scaling and offsetting to ensure yts are in [0, 1].
  • Every regression example is converted to, where t(a) = |a − yt| is the absolute loss induced by yt.
  • When action at is taken, the algorithm receives bandit feedback t(at), as opposed to the usual label yt.
  • The authors include a synthetic dataset ds, created by the linear regression model with additive Gaussian noise
  • Conclusion:

    The smoothing approach has several appealing properties. The authors look for a good interval of actions, which is possible even when the best single action is impossible to find.
  • The approach is principled, leading to specific, interpretable guarantees.Contextual bandit learning for continuous actions with unknown structure is quite tractable via the CATS algorithm, as the authors have shown theoretically and empirically.
  • The authors' study of efficient contextual bandits with continuous actions can be applied to a wide range of applications, such as precision medicine, personalized recommendations, data center optimization, operating systems, networking, etc
  • Many of these applications have potential for significant positive impact to society, but these methods can cause unintend harms, for example by creating filter bubble effects when deployed in recommendation engines.
  • The authors are certainly mindful of these issues, and encourage practitioners to consider these consequences when deploying interactive learning systems
Related work
  • Contextual bandits are quite well-understood for small, discrete action spaces, with rich theoretical results and successful deployments in practice. To handle large or infinite action spaces, most prior work either makes strong parametric assumptions such as linearity, or posits some continuity assumptions such as Lipschitzness. More background can be found in [16, 47, 38].

    Bandits with Lipschitz assumptions were introduced in [5], and optimally solved in the worst case by [31]. [32, 33, 17, 46] achieve optimal data-dependent regret bounds, while several papers relax global smoothness assumptions with various local definitions [7, 32, 33, 17, 45, 41, 25]. This literature mainly focuses on the non-contextual version, except for [46, 34, 18, 52] (which only consider a fixed policy set Π). As argued in [35], the smoothing-based approach is productive in these settings, and extends far beyond, e.g., to instances when the global optimum is a discontinuity.
Reference
  • Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
    Google ScholarLocate open access versionFindings
  • Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, and Alex Slivkins. Making contextual decisions with low technical debt. arxiv:1606.03966, 2017.
    Findings
  • Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014.
    Google ScholarLocate open access versionFindings
  • Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, 2017.
    Google ScholarLocate open access versionFindings
  • Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 1995.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Ronald Ortner, and Csaba Szepesvari. Improved rates for the stochastic continuumarmed bandit problem. In Conference on Learning Theory, 2007.
    Google ScholarLocate open access versionFindings
  • Peter L Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. 2008.
    Google ScholarFindings
  • Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009.
    Google ScholarLocate open access versionFindings
  • Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error-correcting tournaments. In International Conference on Algorithmic Learning Theory, pages 247–262.
    Google ScholarLocate open access versionFindings
  • Alina Beygelzimer, John Langford, and Bianca Zadrozny. Weighted one-against-all. In Proceedings of the 20th International Conference on International Conference on Machine Learning, 2005.
    Google ScholarLocate open access versionFindings
  • Guy Blanc, Jane Lange, and Li-Yang Tan. Top-down induction of decision trees: rigorous guarantees and inherent limitations. arXiv preprint arXiv:1911.07375, 2019.
    Findings
  • Avrim Blum. Rank-r decision trees are a subclass of r-decision lists. Information Processing Letters, 42(4):183–185, 1992.
    Google ScholarLocate open access versionFindings
  • Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT, pages 203–208, 1999.
    Google ScholarLocate open access versionFindings
  • Alon Brutzkus, Amit Daniely, and Eran Malach. On the optimality of trees generated by id3. arXiv preprint arXiv:1907.05444, 2019.
    Findings
  • Sebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • Sebastien Bubeck, Remi Munos, Gilles Stoltz, and Csaba Szepesvari. X-armed bandits. Journal of Machine Learning Research, 2011.
    Google ScholarLocate open access versionFindings
  • Nicolo Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and Sebastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, 2017.
    Google ScholarLocate open access versionFindings
  • Guanhua Chen, Donglin Zeng, and Michael R Kosorok. Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association, 111(516):1509– 1521, 2016.
    Google ScholarLocate open access versionFindings
  • Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In Uncertainty in Artificial Intelligence, 2011.
    Google ScholarLocate open access versionFindings
  • Miroslav Dudık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
    Google ScholarLocate open access versionFindings
  • Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples. Information and Computation, 82(3):231–246, 1989.
    Google ScholarLocate open access versionFindings
  • David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
    Google ScholarLocate open access versionFindings
  • Parikshit Gopalan, Adam Tauman Kalai, and Adam R Klivans. Agnostically learning decision trees. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 527–536, 2008.
    Google ScholarLocate open access versionFindings
  • Jean-Bastien Grill, Michal Valko, and Remi Munos. Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision lists and trees. Information and Computation, 126(2):114–122, 1996.
    Google ScholarLocate open access versionFindings
  • Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
    Google ScholarLocate open access versionFindings
  • Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A deep reinforcement learning perspective on internet congestion control. In International Conference on Machine Learning, pages 3050–3059, 2019.
    Google ScholarLocate open access versionFindings
  • Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In International Conference on Artificial Intelligence and Statistics, pages 1243–1251, 2018.
    Google ScholarLocate open access versionFindings
  • TE Klein, RB Altman, Niclas Eriksson, BF Gage, SE Kimmel, MT Lee, NA Limdi, D Page, DM Roden, MJ Wagner, et al. Estimation of the warfarin dose with clinical and pharmacogenetic data. New England Journal of Medicine, 360(8):753–764, 2009.
    Google ScholarLocate open access versionFindings
  • Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, 2004.
    Google ScholarLocate open access versionFindings
  • Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Symposium on Theory of Computing, 2008.
    Google ScholarLocate open access versionFindings
  • Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. Journal of the ACM, 2019. To appear. Merged and revised version of conference papers in ACM STOC 2008 and ACM-SIAM SODA 2010. Also available at http://arxiv.org/abs/1312.1277.
    Findings
  • Andreas Krause and Cheng S. Ong. Contextual gaussian process bandit optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2447–2455. Curran Associates, Inc., 2011.
    Google ScholarLocate open access versionFindings
  • Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: smoothing, zooming, and adapting. In Conference on Learning Theory, 2019.
    Google ScholarLocate open access versionFindings
  • Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. SIAM Journal on Computing, 22(6):1331–1348, 1993.
    Google ScholarLocate open access versionFindings
  • John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems, 2007.
    Google ScholarLocate open access versionFindings
  • Tor Lattimore and Csaba Szepesvari. Bandit algorithms. preprint, 2018.
    Google ScholarFindings
  • Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control. In Advances in Neural Information Processing Systems, pages 3814–3823, 2018.
    Google ScholarLocate open access versionFindings
  • Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
    Google ScholarLocate open access versionFindings
  • Stanislav Minsker. Estimation of extreme values and associated level sets of a regression function via selective sampling. In Conference on Learning Theory, 2013.
    Google ScholarLocate open access versionFindings
  • Francesco Orabona and David Pal. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 577–585, 2016.
    Google ScholarLocate open access versionFindings
  • Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for contextual bandits. In ICML, pages 1977–1985, 2016.
    Google ScholarLocate open access versionFindings
  • Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987.
    Google ScholarLocate open access versionFindings
  • Aleksandrs Slivkins. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information Processing Systems, 2011.
    Google ScholarLocate open access versionFindings
  • Aleksandrs Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 2014.
    Google ScholarLocate open access versionFindings
  • Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends R in Machine Learning, 12(1-2):1–286, 2019.
    Google ScholarLocate open access versionFindings
  • Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
    Google ScholarLocate open access versionFindings
  • Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, pages 2159– 2168, 2016.
    Google ScholarLocate open access versionFindings
  • Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517.
    Google ScholarLocate open access versionFindings
  • Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1995.
    Google ScholarFindings
  • Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. Towards practical lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.
    Findings
  • 2. Instead of finding a policy π that approximately minimizes Vt(πh) for a fixed h, the algorithm first finds approximate minimizer of Vt(πh) for every h ∈ H (namely πt,h), and selects πt among the set {πt,h}h∈H, using a structural risk minimization [51] procedure (line 9). Specifically, the choice of ht+1 ensures that the expected loss of πt+1,ht+1 has competitive performance compared with those of the πh’s, for all π in Π and all h in H.
    Google ScholarFindings
  • 2. The above uniform-h-smoothed regret rate in terms of h and T, i.e. O T√2/3, is unimh provable in general, and is therefore Pareto optimal. This can be seen from the following result from [35, Theorem 11]: there exists a continuous-action CB problem with action space [0, 1], constants c, T0 > 0, such that for any algorithm and any T ≥ T0, there exists two bandwidths h1
    Google ScholarLocate open access versionFindings
  • 2. Given a finite set of policies Π, with probability 1 − δ, for all π in Π, Vt(πh) − V (πh) ≤ 4 ln |Π| + ln t (at )1(a−at ≤h) vol([a−h,a+h]∩[0,1])·Pt (at
    Google ScholarFindings
  • Lebesgue measure. Therefore, if, say at is in [0, h], the induced IPS cost function ct can take many possible positive values for a in region [0, h], depending on the value of vol([a − h, a + h] ∩ [0, 1]). It turns out that enforcing the piecewise constant structure of the cost vector (as is done by restricting the CSMC vectors to only consider entries in a in AK ∩ [h, 1 − h]) is vital to achieve O(log K)
    Google ScholarLocate open access versionFindings
  • 2. If αd.id < v.id < βd.id, then for all a ∈ range(T v), c(a) = c∗.
    Google ScholarFindings
  • 3. If v.cost is available, it must equal c(T v(x)); in addition, Return cost(v, αd, βd) returns c(T v(x)) correctly.
    Google ScholarFindings
  • 1. If v = αd and v = βd, then from the first two items we have just shown, we can decide the value of cv(T v(x)) directly by comparison with the id’s of α and β, which is consistent with the implementation of Return cost; also note that in this case, v.cost gets assigned to Return cost(v, αd, βd), which is also cv(T v(x)).
    Google ScholarFindings
  • 2. Otherwise, v = αd or v = βd. In this case, Return cost returns the stored cost of v, i.e. v.cost. It suffices to show that αd.cost (resp. βd.cost), is indeed c(T αd (x)) (resp. c(T βd (x))), which we show by induction: Base case. In the case when d = D, αD.cost = α.cost (resp. βD.cost = β.cost) is directly calculated in line 2 of Algorithm 10, and is indeed c(label(α)) = cv(α) (resp.
    Google ScholarFindings
  • 2. Generate -greedy action distribution, take action, create (xt, ct) implicitly by representing ct as (amin, amax, c∗): these steps take O(1) time as they are based on manipulations of piecewise constant density with only 3 pieces.
    Google ScholarFindings
  • 3. Online train tree(T, (xt, ct)): this takes O(D) = O(log K) time, because at each of the D levels, there are at most 2 nodes to be updated, and for every such node, Return cost takes O(1) time to retrieve the costs of both subtrees.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments