The max K-armed bandit: a new model of exploration applied to search heuristic selection

AAAI(2005)

引用 101|浏览401
暂无评分
摘要
The multiarmed bandit is often used as an analogy for the tradeoff between exploration and exploitation in search problems. The classic problem involves allocating trials to the arms of a multiarmed slot machine to maximize the expected sum of rewards. We pose a new variation of the multiarmed bandit--the Max K-Armed Bandit--in which trials must be allocated among the arms to maximize the expected best single sample reward of the series of trials. Motivation for the Max K-Armed Bandit is the allocation of restarts among a set of multistart stochastic search algorithms. We present an analysis of this Max K-Armed Bandit showing under certain assumptions that the optimal strategy allocates trials to the observed best arm at a rate increasing double exponentially relative to the other arms. This motivates an exploration strategy that follows a Boltzmann distribution with an exponentially decaying temperature parameter. We compare this exploration policy to policies that allocate trials to the observed best arm at rates faster (and slower) than double exponentially. The results confirm, for two scheduling domains, that the double exponential increase in the rate of allocations to the observed best heuristic outperfonns the other approaches.
更多
查看译文
关键词
exploration policy,heuristic selection,multiarmed slot machine,max k-armed bandit,new model,multiarmed bandit,observed best arm,exponentially decaying temperature parameter,exploration strategy,observed best heuristic,double exponentially,double exponential increase,boltzmann distribution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要