Choice Bandits

NIPS 2020(2020)

引用 14|浏览115
暂无评分
摘要
There has been much interest in recent years in the problem of dueling bandits, where on each round the learner plays a pair of arms and receives as feedback the outcome of a relative pairwise comparison between them. Here we study a natural generalization, that we term choice bandits , where the learner plays a set of up to k ≥ 2 arms, and receives limited relative feedback in the form of a single multiway choice among the pulled arms, drawn from an underlying multiway choice model. We study choice bandits under a very general class of choice models that is characterized by the existence of a unique 'best' arm (which we term generalized Condorcet winner), and includes as special cases the well-studied multinomial logit (MNL) and multinomial probit (MNP) choice models, and more generally, the class of random utility models with i.i.d. noise (IID-RUMs). We propose an algorithm for choice bandits, termed Winner Beats All (WBA), with a distribution dependent O (log T ) regret bound under all these choice models. The challenge in our setting is that the decision space is Θ( n k ), which is large for even moderate k. Our algorithm addresses this challenge by extracting just O ( n 2 ) statistics from multiway choices and exploiting the existence of a unique 'best' arm to find arms that are competitive to this arm in order to construct sets with low regret. Since these statistics are extracted from the same choice observations, one needs a careful martingale analysis in order to show that these statistics are concentrated. We complement our upper bound result with a lower bound result, which shows that our upper bound is order-wise optimal. Our experiments demonstrate that for the special case of k = 2, our algorithm is competitive with previous dueling bandit algorithms, and for the more general case k > 2, outperforms the recently proposed MaxMinUCB algorithm designed for the MNL model.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要