Multi-Armed Bandits In Multi-Agent Networks

2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2017)

引用 87|浏览10
暂无评分
摘要
This paper addresses the multi-armed bandit problem in a multi-player framework. Players explore a finite set of arms with stochastic rewards, and the reward distribution of each arm is player-dependent. The goal is to find the best global arm, i.e., the one with the largest expected reward when averaged out among players. To achieve this goal, we develop a distributed variant of the well-known UCB1 algorithm. Confined to a network structure, players exchange information locally to estimate the global rewards, while using a confidence bound relying on the network characteristics. Then, at each round, each player votes for an arm, and the majority vote is played as the network action. The whole network gains the reward of the network action, hoping to maximize the global welfare. The performance of the algorithm is measured via the notion of network regret. We prove that the regret scales logarithmically with respect to time horizon and inversely in the spectral gap of the network. Our algorithm is optimal in the sense that in a complete network it scales down the regret of its single-player counterpart by the network size. We demonstrate numerical experiments to verify our theoretical results.
更多
查看译文
关键词
Sequential decision-making, multi-armed bandits, multi-agent networks, distributed learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要