## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Adversarial Blocking Bandits

NIPS 2020, (2020)

EI

Keywords

Abstract

We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplie...More

Code:

Data:

Introduction

- This paper investigates the blocking bandit model where pulling an arm results in having that arm blocked for a deterministic number of rounds.
- Apart from the adversarial blocking bandit setting, the authors investigate two additional versions of the model: (i) The offline MAXREWARD problem, where all the rewards and blocking durations are known in advance; and (ii) the online version of MAXREWARD, in which the authors see the corresponding rewards and blocking durations of the arms at each time step before the authors choose an arm to pull.

Highlights

- This paper investigates the blocking bandit model where pulling an arm results in having that arm blocked for a deterministic number of rounds
- In this paper we propose the adversarial blocking bandit setting, where both the sequence of rewards and blocking durations per arm can be arbitrary
- Apart from the adversarial blocking bandit setting, we investigate two additional versions of the model: (i) The offline MAXREWARD problem, where all the rewards and blocking durations are known in advance; and (ii) the online version of MAXREWARD, in which we see the corresponding rewards and blocking durations of the arms at each time step before we choose an arm to pull
- For the case of unknown path variation budget BT, we propose Repeating Greedy Algorithm (RGA)-META that uses Exp3 as a meta-bandit algorithm to learn an appropriate path variation budget and runs RGA with it
- A very recent work [Basu et al, 2020] extends the stochastic blocking bandit to a contextual setting where a context is sampled according to a distribution each time period and the reward per arm is drawn from a distribution with the mean depending on the pulled arm and the given context
- Note that RGA requires knowledge of BT in order to properly set ∆T. To resolve this issue we propose META-RGA, a meta-bandit algorithm, where each arm corresponds to an instance of the RGA algorithm whose ∆T parameter tuned for a different variation budget

Results

- In the optimization setting where the mean rewards and blocking durations are known, they consider a simpler version of the MAXREWARD problem for their setting and show that the problem is as hard as the PINWHEEL Scheduling on dense instances [Jacobs and Longo, 2014] and provide that a simple greedy algorithm achieves an approximation ratio of (1 − 1/e − O(1/T )) where T is total time period.
- They provide lower and upper regret bounds that depend on the number of arms, mean rewards, and log(T ).
- A very recent work [Basu et al, 2020] extends the stochastic blocking bandit to a contextual setting where a context is sampled according to a distribution each time period and the reward per arm is drawn from a distribution with the mean depending on the pulled arm and the given context.
- Similar to the work of Basu et al [2019], Basu et al [2020] derive an online algorithm with an approximation ratio that depends on the maximum blocking durations and provide upper and lower α-regret bounds of O(log T ) and Ω(log T ), respectively.
- The authors show that Greedy-BAA provides an approximation guarantee to the offline MAXREWARD problem that depends on the blocking durations and the variation budget.
- MAXREWARD problem with path variation budget BT = 0 and homogeneous blocking durations per arm.
- The authors compare the performance of a policy with respect to the dynamic oracle algorithm that returns the offline optimal solution of MAXREWARD .The authors define the α-regret under a policy π ∈ P as the worst case difference between an α-optimal sequence of actions and the expected performance under policy π.

Conclusion

- The authors show that if either the variation budget or the maximum blocking duration is large, the lower bound of the α-regret is Θ(T ).
- The authors discuss a potential lower bound for the α-regret of the adversarial blocking bandit problem in the case of BT ∈ o(KT ) and D ∈ O(1).

Summary

- This paper investigates the blocking bandit model where pulling an arm results in having that arm blocked for a deterministic number of rounds.
- Apart from the adversarial blocking bandit setting, the authors investigate two additional versions of the model: (i) The offline MAXREWARD problem, where all the rewards and blocking durations are known in advance; and (ii) the online version of MAXREWARD, in which the authors see the corresponding rewards and blocking durations of the arms at each time step before the authors choose an arm to pull.
- In the optimization setting where the mean rewards and blocking durations are known, they consider a simpler version of the MAXREWARD problem for their setting and show that the problem is as hard as the PINWHEEL Scheduling on dense instances [Jacobs and Longo, 2014] and provide that a simple greedy algorithm achieves an approximation ratio of (1 − 1/e − O(1/T )) where T is total time period.
- They provide lower and upper regret bounds that depend on the number of arms, mean rewards, and log(T ).
- A very recent work [Basu et al, 2020] extends the stochastic blocking bandit to a contextual setting where a context is sampled according to a distribution each time period and the reward per arm is drawn from a distribution with the mean depending on the pulled arm and the given context.
- Similar to the work of Basu et al [2019], Basu et al [2020] derive an online algorithm with an approximation ratio that depends on the maximum blocking durations and provide upper and lower α-regret bounds of O(log T ) and Ω(log T ), respectively.
- The authors show that Greedy-BAA provides an approximation guarantee to the offline MAXREWARD problem that depends on the blocking durations and the variation budget.
- MAXREWARD problem with path variation budget BT = 0 and homogeneous blocking durations per arm.
- The authors compare the performance of a policy with respect to the dynamic oracle algorithm that returns the offline optimal solution of MAXREWARD .The authors define the α-regret under a policy π ∈ P as the worst case difference between an α-optimal sequence of actions and the expected performance under policy π.
- The authors show that if either the variation budget or the maximum blocking duration is large, the lower bound of the α-regret is Θ(T ).
- The authors discuss a potential lower bound for the α-regret of the adversarial blocking bandit problem in the case of BT ∈ o(KT ) and D ∈ O(1).

Related work

- Stochastic Blocking Bandits. The most relevant work to our setting is the stochastic blocking bandit model. As mentioned before, Basu et al [2019] introduce and study this model where the reward per each time period is generated from a stochastic distribution with mean μk reward for each arm k and the blocking duration is fixed across all time period for each arm k (e.g., Dtk = Dk for all t and k). In the optimization setting where the mean rewards and blocking durations are known, they consider a simpler version of the MAXREWARD problem for their setting and show that the problem is as hard as the PINWHEEL Scheduling on dense instances [Jacobs and Longo, 2014] and provide that a simple greedy algorithm (see Algorithm 1) achieves an approximation ratio of (1 − 1/e − O(1/T )) where T is total time period. In the bandit setting, they provide lower and upper regret bounds that depend on the number of arms, mean rewards, and log(T ). A very recent work [Basu et al, 2020] extends the stochastic blocking bandit to a contextual setting where a context is sampled according to a distribution each time period and the reward per arm is drawn from a distribution with the mean depending on the pulled arm and the given context. Similar to the work of Basu et al [2019], Basu et al [2020] derive an online algorithm with an approximation ratio that depends on the maximum blocking durations and provide upper and lower α-regret bounds of O(log T ) and Ω(log T ), respectively. However, the results from this models cannot be directly applied to the adversarial setting due to the differences between the stochastic and adversarial reward generation schemes.

Funding

- Acknowledgments and Disclosure of Funding Nicholas Bishop was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership grant
- Debmalya Mandal was supported by a Columbia Data Science Institute Post-Doctoral Fellowship

Reference

- Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138–158, 2019.
- Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216. IEEE, 2013.
- Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Conference on Learning Theory, pages 1109–1134, 2014.
- Soumya Basu, Rajat Sen, Sujay Sanghavi, and Sanjay Shakkottai. Blocking bandits. In Advances in Neural Information Processing Systems 32, pages 4784–4793, 2019.
- Soumya Basu, Orestis Papadigenopoulos, Constantine Caramanis, and Sanjay Shakkottai. Contextual blocking bandits. arXiv, abs/2003.03426, 2020.
- Marco Bender, Clemens Thielen, and Stephan Westphal. Online interval scheduling with a bounded number of failures. Journal of Scheduling, 20(5):443–457, 2017.
- Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with nonstationary rewards. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 199–207, 2014.
- Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. In Advances in neural information processing systems, pages 273–280, 2009.
- Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013.
- Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979.
- Chien-Ju Ho and Jennifer Wortman Vaughan. Online task assignment in crowdsourcing markets. In Twenty-sixth AAAI conference on artificial intelligence, 2012.
- N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202–219, 2019.
- Tobias Jacobs and Salvatore Longo. A new perspective on the windows scheduling problem. ArXiv, abs/1410.7237, 2014.
- Satyen Kale, Chansoo Lee, and David Pal. Hardness of online sleeping combinatorial optimization problems. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2181–2189. Curran Associates, Inc., 2016.
- A Karthik, Arpan Mukhopadhyay, and Ravi R Mazumdar. Choosing among heterogeneous server clouds. Queueing Systems, 85(1-2):1–29, 2017.
- Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Mach. Learn., 80(2–3):245–272, 2010.
- Antoon WJ Kolen, Jan Karel Lenstra, Christos H Papadimitriou, and Frits CR Spieksma. Interval scheduling: A survey. Naval Research Logistics (NRL), 54(5):530–543, 2007.
- Mikhail Y Kovalyov, CT Ng, and TC Edwin Cheng. Fixed interval scheduling: Models, applications, computational complexity and algorithms. European journal of operational research, 178(2): 331–342, 2007.
- Hiroyuki Miyazawa and Thomas Erlebach. An improved randomized on-line algorithm for a weighted interval selection problem. Journal of Scheduling, 7(4):293–311, 2004.
- Gergely Neu and Gábor Bartók. Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. The Journal of Machine Learning Research, 17(1): 5355–5375, 2016.
- Gergely Neu and Michal Valko. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, pages 2780–2788, 2014.
- Anshuka Rangi, Massimo Franceschetti, and Long Tran-Thanh. Unifying the stochastic and the adversarial bandits with knapsack. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 3311–3317, 2019.
- Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R Jennings. Epsilon–first policies for budget–limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
- Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
- Long Tran-Thanh, Sebastian Stein, Alex Rogers, and Nicholas R Jennings. Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artificial Intelligence, 214:89–111, 2014.
- Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263–1291, 2018.
- Ge Yu and Sheldon H Jacobson. Approximation algorithms for scheduling c-benevolent jobs on weighted machines. IISE Transactions, 52(4):432–443, 2020.

Tags

Comments