Combinatorial Pure Exploration for Dueling Bandit

ICML, pp. 1531-1541, 2020.

Cited by: 0|Views45
EI
Weibo:
We provide their sample complexity upper bounds and a problemdependent lower bound for combinatorial pure exploration for dueling bandit with Borda winner

Abstract:

In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Multi-Armed Bandit (MAB) (Lai & Robbins, 1985; Thompson, 1933; Auer et al, 2002; Agrawal & Goyal, Alphabetical order 1Microsoft Research, Beijing, China Haoyu Zhao

    2012) is a classic model that characterizes the explorationexploitation tradeoff in online learning.
  • The pure exploration task (Even-Dar et al, 2006; Chen & Li, 2016; Sabato, 2019) is an important variant of the MAB problems, where the objective is to identify the best arm with high confidence, using as few samples as possible.
  • The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms.
  • The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms. Gabillon et al (2016); Chen et al (2017) follow this setting and further improve the sample complexity
Highlights
  • Multi-Armed Bandit (MAB) (Lai & Robbins, 1985; Thompson, 1933; Auer et al, 2002; Agrawal & Goyal, Alphabetical order 1Microsoft Research, Beijing, China Haoyu Zhao

    2012) is a classic model that characterizes the explorationexploitation tradeoff in online learning
  • For the Borda winner metric, we reduce combinatorial pure exploration for dueling bandit to the original combinatorial pure exploration for multi-armed bandit problem, and design algorithms CLUCB-Borda-PAC and CLUCB-Borda-Exact with polynomial running time per round
  • We provide their sample complexity upper bounds and a problemdependent lower bound for combinatorial pure exploration for dueling bandit with Borda winner
  • We present a reduction of combinatorial pure exploration for dueling bandit for Borda winner to the conventional combinatorial pure exploration for multi-armed bandit (Chen et al, 2014) problem
  • We first introduce the efficient pure exploration part assuming there exists “an oracle” that performs like a black-box, and we show the correctness and the sample complexity of CAR-Cond given the oracle
  • We provide sample complexity upper and lower bounds for these algorithms
Conclusion
  • Conclusion and Future Work

    In this paper, the authors formulate the combinatorial pure exploration for dueling bandit (CPE-DB) problem.
  • For Borda winner, the authors first reduce the problem to CPE-MAB, and propose efficient PAC and exact algorithms.
  • For a subclass of problems the upper bound of the exact algorithm matches the lower bound when ignoring the logarithmic factor.
  • For Condorcet winner, the authors first design an FPTAS for a properly extended offline problem, and employ this FPTAS to design a novel online algorithm CAR-Cond.
  • CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB
Summary
  • Introduction:

    Multi-Armed Bandit (MAB) (Lai & Robbins, 1985; Thompson, 1933; Auer et al, 2002; Agrawal & Goyal, Alphabetical order 1Microsoft Research, Beijing, China Haoyu Zhao

    2012) is a classic model that characterizes the explorationexploitation tradeoff in online learning.
  • The pure exploration task (Even-Dar et al, 2006; Chen & Li, 2016; Sabato, 2019) is an important variant of the MAB problems, where the objective is to identify the best arm with high confidence, using as few samples as possible.
  • The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms.
  • The learner plays an arm and observes the random reward, with the objective of identifying the best combinatorial subset of arms. Gabillon et al (2016); Chen et al (2017) follow this setting and further improve the sample complexity
  • Conclusion:

    Conclusion and Future Work

    In this paper, the authors formulate the combinatorial pure exploration for dueling bandit (CPE-DB) problem.
  • For Borda winner, the authors first reduce the problem to CPE-MAB, and propose efficient PAC and exact algorithms.
  • For a subclass of problems the upper bound of the exact algorithm matches the lower bound when ignoring the logarithmic factor.
  • For Condorcet winner, the authors first design an FPTAS for a properly extended offline problem, and employ this FPTAS to design a novel online algorithm CAR-Cond.
  • CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB
Related work
  • Combinatorial pure exploration The combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem is first formulated by Chen et al (2014) and generalizes the multi-armed bandit pure exploration task to general combinatorial structures. Gabillon et al (2016) follow the setting of (Chen et al, 2014) and propose algorithms with improved sample complexity but a loss of computational efficiency. Chen et al (2017) further design algorithms for this problem that have tighter sample complexity and pseudo-polynomial running time. Wu et al (2015) study another combinatorial pure exploration case in which given a graph, at each time step, a learner samples a path with the objective of identifying the optimal edge.

    Dueling bandit The dueling bandit problem (Yue et al, 2012; Ramamohan et al, 2016; Sui et al, 2018), first proposed by (Yue et al, 2012), is an important variation of the multi-armed bandit setting. According to the assumptions on preference structures and definitions of the optimal arm (winner), previous methods can be categorized as methods on Condorcet winner (Komiyama et al, 2015; Xu et al, 2019), methods on Borda winner (Jamieson et al, 2015; Xu et al, 2019), methods on Copeland winner (Wu & Liu, 2016; Agrawal & Chaporkar, 2019), etc. Recently, Saha & Gopalan (2019) propose a variant of combinatorial bandits with relative feedback. In their setting, a learner plays a subset of arms (assuming each arm has an unknown positive value) in a time step and observes the ranking feedback, and the goal is to minimize the cumulative regret. Therefore, their model is quite different from ours.
Funding
  • The work of Yihan Du and Longbo Huang is supported in part by the National Natural Science Foundation of China Grant 61672316, the Zhongguancun Haihua Institute for Frontier Information Technology and the Turing AI Institute of Nanjing
Reference
  • Agrawal, N. and Chaporkar, P. Klucb approach to copeland bandits. arXiv preprint arXiv:1902.02778, 2019.
    Findings
  • Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012.
    Google ScholarLocate open access versionFindings
  • Alwin, D. F. and Krosnick, J. A. The measurement of values in surveys: A comparison of ratings and rankings. Public Opinion Quarterly, 49(4):535–552, 1985.
    Google ScholarLocate open access versionFindings
  • Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
    Google ScholarLocate open access versionFindings
  • Ben-Akiva, M., Bradley, M., Morikawa, T., Benjamin, J., Novak, T., Oppewal, H., and Rao, V. Combining revealed and stated preferences data. Marketing Letters, 5(4):335–349, 1994.
    Google ScholarLocate open access versionFindings
  • Black, D. On the rationale of group decision-making. Journal of Political Economy, 56(1):23–34, 1948.
    Google ScholarLocate open access versionFindings
  • Bubeck, S., Wang, T., and Viswanathan, N. Multiple identifications in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, pp. 258–265, 2013.
    Google ScholarLocate open access versionFindings
  • Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231–357, 2015.
    Google ScholarLocate open access versionFindings
  • Chen, B. and Frazier, P. I. Dueling bandits with weak regret. In Proceedings of the 34th International Conference on Machine Learning, pp. 731–73JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Chen, L. and Li, J. On the optimal sample complexity for best arm identification. arXiv preprint arXiv:1511.03774, 2015.
    Findings
  • Chen, L. and Li, J. Open problem: Best arm identification: Almost instance-wise optimality and the gap entropy conjecture. In Conference on Learning Theory, pp. 1643–1646, 2016.
    Google ScholarLocate open access versionFindings
  • Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp. 482– 534, 2017.
    Google ScholarLocate open access versionFindings
  • Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 379–387, 2014.
    Google ScholarLocate open access versionFindings
  • Copeland, A. H. A reasonable social welfare function. Technical report, mimeo, 1951. University of Michigan, 1951.
    Google ScholarFindings
  • Emerson, P. The original borda count and partial voting. Social Choice and Welfare, 40(2):353–358, 2013.
    Google ScholarLocate open access versionFindings
  • Emerson, P. From Majority Rule to Inclusive Politics. Springer, 2016.
    Google ScholarFindings
  • Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(Jun):1079–1105, 2006.
    Google ScholarLocate open access versionFindings
  • Gabillon, V., Lazaric, A., Ghavamzadeh, M., Ortner, R., and Bartlett, P. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp. 1004–1012, 2016.
    Google ScholarLocate open access versionFindings
  • Gehrlein, W. V. The condorcet criterion and committee selection. Mathematical Social Sciences, 10(3):199–209, 1985.
    Google ScholarLocate open access versionFindings
  • Graepel, T. and Herbrich, R. Ranking and matchmaking. Game Developer Magazine, 25:34, 2006.
    Google ScholarLocate open access versionFindings
  • Jamieson, K., Katariya, S., Deshpande, A., and Nowak, R. Sparse dueling bandits. In Artificial Intelligence and Statistics, pp. 416–424, 2015.
    Google ScholarLocate open access versionFindings
  • Jerrum, M., Sinclair, A., and Vigoda, E. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the ACM (JACM), 51(4):671–697, 2004.
    Google ScholarLocate open access versionFindings
  • Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. Accurately interpreting clickthrough data as implicit feedback. In ACM SIGIR Forum, volume 51, pp. 4–11. Acm New York, NY, USA, 2017.
    Google ScholarLocate open access versionFindings
  • Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, volume 12, pp. 655–662, 2012.
    Google ScholarLocate open access versionFindings
  • Karnin, Z. S. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, pp. 145–153, 2016.
    Google ScholarLocate open access versionFindings
  • Kaufmann, E., Cappe, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1–42, 2016.
    Google ScholarLocate open access versionFindings
  • Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on Learning Theory, pp. 1141–1154, 2015.
    Google ScholarLocate open access versionFindings
  • Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
    Google ScholarLocate open access versionFindings
  • McLean, I. The borda and condorcet principles: three medieval applications. Social Choice and Welfare, 7(2):99– 108, 1990.
    Google ScholarLocate open access versionFindings
  • Radlinski, F., Kurup, M., and Joachims, T. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 43–52, 2008.
    Google ScholarLocate open access versionFindings
  • Ramamohan, S. Y., Rajkumar, A., and Agarwal, S. Dueling bandits: Beyond condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems, pp. 1253–1261, 2016.
    Google ScholarLocate open access versionFindings
  • Saari, D. G. and Merlin, V. R. The copeland method. Economic Theory, 8(1):51–76, 1996.
    Google ScholarLocate open access versionFindings
  • Sabato, S. Epsilon-best-arm identification in pay-perreward multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 2876–2886, 2019.
    Google ScholarLocate open access versionFindings
  • Saha, A. and Gopalan, A. Combinatorial bandits with relative feedback. In Advances in Neural Information Processing Systems, pp. 983–993, 2019.
    Google ScholarLocate open access versionFindings
  • Saip, H. B. and Lucchesi, C. L. Matching algorithms for bipartite graph. Relatorio Tecnico, 700(03), 1993.
    Google ScholarLocate open access versionFindings
  • Sui, Y., Zoghi, M., Hofmann, K., and Yue, Y. Advancements in dueling bandits. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 5502–5510, 2018.
    Google ScholarLocate open access versionFindings
  • Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
    Google ScholarLocate open access versionFindings
  • Wu, H. and Liu, X. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pp. 649–657, 2016.
    Google ScholarLocate open access versionFindings
  • Wu, Y., Gyorgy, A., and Szepesvari, C. On identifying good options under combinatorially structured feedback in finite noisy environments. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1283–1291, 2015.
    Google ScholarLocate open access versionFindings
  • Xu, L., Honda, J., and Sugiyama, M. Dueling bandits with qualitative feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5549– 5556, 2019.
    Google ScholarLocate open access versionFindings
  • Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
    Google ScholarLocate open access versionFindings
  • Zoghi, M., Whiteson, S., Munos, R., and De Rijke, M. Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pp. II–10, 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments