Stochastic Multi-armed Bandits in Constant Space
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84(2018)
摘要
We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all $K$ arms. We give an algorithm using $O(1)$ words of space with regret \[ \sum_{i=1}^{K}\frac{1}{\Delta_i}\log \frac{\Delta_i}{\Delta}\log T \] where $\Delta_i$ is the gap between the best arm and arm $i$ and $\Delta$ is the gap between the best and the second-best arms. If the rewards are bounded away from $0$ and $1$, this is within an $O(\log 1/\Delta)$ factor of the optimum regret possible without space constraints.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络