Multi-Armed Bandits with Non-Stationary Rewards
arXiv (Cornell University)(2017)
摘要
The multi-armed bandit problem where the rewards are realizations of general non-stationary stochastic processes is a challenging setting which has not been previously tackled in the bandit literature in its full generality. We present the first theoretical analysis of this problem by deriving guarantees for both the path-dependent dynamic pseudo-regret and the standard pseudo-regret that, remarkably, are both logarithmic in the number of rounds under certain natural conditions. We describe several UCB-type algorithms based on the notion of weighted discrepancy, a key measure of non-stationarity for stochastic processes. We show that discrepancy provides a unified framework for the analysis of non-stationary rewards. Our experiments demonstrate a significant improvement in practice compared to standard benchmarks.
更多查看译文
关键词
multi-armed,non-stationary
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络