# Combinatorial Pure Exploration for Dueling Bandit

ICML, pp. 1531-1541, 2020.

EI

Weibo:

Abstract:

In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with...More

Code:

Data:

Introduction

- Multi-Armed Bandit (MAB) (Lai & Robbins, 1985; Thompson, 1933; Auer et al, 2002; Agrawal & Goyal, Alphabetical order 1Microsoft Research, Beijing, China Haoyu Zhao

2012) is a classic model that characterizes the explorationexploitation tradeoff in online learning. - The pure exploration task (Even-Dar et al, 2006; Chen & Li, 2016; Sabato, 2019) is an important variant of the MAB problems, where the objective is to identify the best arm with high confidence, using as few samples as possible.
- The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms.
- The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms. Gabillon et al (2016); Chen et al (2017) follow this setting and further improve the sample complexity

Highlights

- Multi-Armed Bandit (MAB) (Lai & Robbins, 1985; Thompson, 1933; Auer et al, 2002; Agrawal & Goyal, Alphabetical order 1Microsoft Research, Beijing, China Haoyu Zhao

2012) is a classic model that characterizes the explorationexploitation tradeoff in online learning - For the Borda winner metric, we reduce combinatorial pure exploration for dueling bandit to the original combinatorial pure exploration for multi-armed bandit problem, and design algorithms CLUCB-Borda-PAC and CLUCB-Borda-Exact with polynomial running time per round
- We provide their sample complexity upper bounds and a problemdependent lower bound for combinatorial pure exploration for dueling bandit with Borda winner
- We present a reduction of combinatorial pure exploration for dueling bandit for Borda winner to the conventional combinatorial pure exploration for multi-armed bandit (Chen et al, 2014) problem
- We first introduce the efficient pure exploration part assuming there exists “an oracle” that performs like a black-box, and we show the correctness and the sample complexity of CAR-Cond given the oracle
- We provide sample complexity upper and lower bounds for these algorithms

Conclusion

**Conclusion and Future Work**

In this paper, the authors formulate the combinatorial pure exploration for dueling bandit (CPE-DB) problem.- For Borda winner, the authors first reduce the problem to CPE-MAB, and propose efficient PAC and exact algorithms.
- For a subclass of problems the upper bound of the exact algorithm matches the lower bound when ignoring the logarithmic factor.
- For Condorcet winner, the authors first design an FPTAS for a properly extended offline problem, and employ this FPTAS to design a novel online algorithm CAR-Cond.
- CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB

Summary

## Introduction:

Multi-Armed Bandit (MAB) (Lai & Robbins, 1985; Thompson, 1933; Auer et al, 2002; Agrawal & Goyal, Alphabetical order 1Microsoft Research, Beijing, China Haoyu Zhao

2012) is a classic model that characterizes the explorationexploitation tradeoff in online learning.- The pure exploration task (Even-Dar et al, 2006; Chen & Li, 2016; Sabato, 2019) is an important variant of the MAB problems, where the objective is to identify the best arm with high confidence, using as few samples as possible.
- The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms.
- The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms. Gabillon et al (2016); Chen et al (2017) follow this setting and further improve the sample complexity
## Conclusion:

**Conclusion and Future Work**

In this paper, the authors formulate the combinatorial pure exploration for dueling bandit (CPE-DB) problem.- For Borda winner, the authors first reduce the problem to CPE-MAB, and propose efficient PAC and exact algorithms.
- For a subclass of problems the upper bound of the exact algorithm matches the lower bound when ignoring the logarithmic factor.
- For Condorcet winner, the authors first design an FPTAS for a properly extended offline problem, and employ this FPTAS to design a novel online algorithm CAR-Cond.
- CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB

Related work

- Combinatorial pure exploration The combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem is first formulated by Chen et al (2014) and generalizes the multi-armed bandit pure exploration task to general combinatorial structures. Gabillon et al (2016) follow the setting of (Chen et al, 2014) and propose algorithms with improved sample complexity but a loss of computational efficiency. Chen et al (2017) further design algorithms for this problem that have tighter sample complexity and pseudo-polynomial running time. Wu et al (2015) study another combinatorial pure exploration case in which given a graph, at each time step, a learner samples a path with the objective of identifying the optimal edge.

Dueling bandit The dueling bandit problem (Yue et al, 2012; Ramamohan et al, 2016; Sui et al, 2018), first proposed by (Yue et al, 2012), is an important variation of the multi-armed bandit setting. According to the assumptions on preference structures and definitions of the optimal arm (winner), previous methods can be categorized as methods on Condorcet winner (Komiyama et al, 2015; Xu et al, 2019), methods on Borda winner (Jamieson et al, 2015; Xu et al, 2019), methods on Copeland winner (Wu & Liu, 2016; Agrawal & Chaporkar, 2019), etc. Recently, Saha & Gopalan (2019) propose a variant of combinatorial bandits with relative feedback. In their setting, a learner plays a subset of arms (assuming each arm has an unknown positive value) in a time step and observes the ranking feedback, and the goal is to minimize the cumulative regret. Therefore, their model is quite different from ours.

Funding

- The work of Yihan Du and Longbo Huang is supported in part by the National Natural Science Foundation of China Grant 61672316, the Zhongguancun Haihua Institute for Frontier Information Technology and the Turing AI Institute of Nanjing

Reference

- Agrawal, N. and Chaporkar, P. Klucb approach to copeland bandits. arXiv preprint arXiv:1902.02778, 2019.
- Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012.
- Alwin, D. F. and Krosnick, J. A. The measurement of values in surveys: A comparison of ratings and rankings. Public Opinion Quarterly, 49(4):535–552, 1985.
- Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- Ben-Akiva, M., Bradley, M., Morikawa, T., Benjamin, J., Novak, T., Oppewal, H., and Rao, V. Combining revealed and stated preferences data. Marketing Letters, 5(4):335–349, 1994.
- Black, D. On the rationale of group decision-making. Journal of Political Economy, 56(1):23–34, 1948.
- Bubeck, S., Wang, T., and Viswanathan, N. Multiple identifications in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, pp. 258–265, 2013.
- Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231–357, 2015.
- Chen, B. and Frazier, P. I. Dueling bandits with weak regret. In Proceedings of the 34th International Conference on Machine Learning, pp. 731–73JMLR. org, 2017.
- Chen, L. and Li, J. On the optimal sample complexity for best arm identification. arXiv preprint arXiv:1511.03774, 2015.
- Chen, L. and Li, J. Open problem: Best arm identification: Almost instance-wise optimality and the gap entropy conjecture. In Conference on Learning Theory, pp. 1643–1646, 2016.
- Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp. 482– 534, 2017.
- Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 379–387, 2014.
- Copeland, A. H. A reasonable social welfare function. Technical report, mimeo, 1951. University of Michigan, 1951.
- Emerson, P. The original borda count and partial voting. Social Choice and Welfare, 40(2):353–358, 2013.
- Emerson, P. From Majority Rule to Inclusive Politics. Springer, 2016.
- Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(Jun):1079–1105, 2006.
- Gabillon, V., Lazaric, A., Ghavamzadeh, M., Ortner, R., and Bartlett, P. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp. 1004–1012, 2016.
- Gehrlein, W. V. The condorcet criterion and committee selection. Mathematical Social Sciences, 10(3):199–209, 1985.
- Graepel, T. and Herbrich, R. Ranking and matchmaking. Game Developer Magazine, 25:34, 2006.
- Jamieson, K., Katariya, S., Deshpande, A., and Nowak, R. Sparse dueling bandits. In Artificial Intelligence and Statistics, pp. 416–424, 2015.
- Jerrum, M., Sinclair, A., and Vigoda, E. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the ACM (JACM), 51(4):671–697, 2004.
- Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. Accurately interpreting clickthrough data as implicit feedback. In ACM SIGIR Forum, volume 51, pp. 4–11. Acm New York, NY, USA, 2017.
- Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, volume 12, pp. 655–662, 2012.
- Karnin, Z. S. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, pp. 145–153, 2016.
- Kaufmann, E., Cappe, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1–42, 2016.
- Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on Learning Theory, pp. 1141–1154, 2015.
- Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- McLean, I. The borda and condorcet principles: three medieval applications. Social Choice and Welfare, 7(2):99– 108, 1990.
- Radlinski, F., Kurup, M., and Joachims, T. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 43–52, 2008.
- Ramamohan, S. Y., Rajkumar, A., and Agarwal, S. Dueling bandits: Beyond condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems, pp. 1253–1261, 2016.
- Saari, D. G. and Merlin, V. R. The copeland method. Economic Theory, 8(1):51–76, 1996.
- Sabato, S. Epsilon-best-arm identification in pay-perreward multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 2876–2886, 2019.
- Saha, A. and Gopalan, A. Combinatorial bandits with relative feedback. In Advances in Neural Information Processing Systems, pp. 983–993, 2019.
- Saip, H. B. and Lucchesi, C. L. Matching algorithms for bipartite graph. Relatorio Tecnico, 700(03), 1993.
- Sui, Y., Zoghi, M., Hofmann, K., and Yue, Y. Advancements in dueling bandits. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 5502–5510, 2018.
- Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Wu, H. and Liu, X. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pp. 649–657, 2016.
- Wu, Y., Gyorgy, A., and Szepesvari, C. On identifying good options under combinatorially structured feedback in finite noisy environments. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1283–1291, 2015.
- Xu, L., Honda, J., and Sugiyama, M. Dueling bandits with qualitative feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5549– 5556, 2019.
- Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- Zoghi, M., Whiteson, S., Munos, R., and De Rijke, M. Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pp. II–10, 2014.

Tags

Comments