An epsilon-Greedy Multiarmed Bandit Approach to Markov Decision Processes

Isa Muqattash,Jiaqiao Hu

Stats(2023)

引用 0|浏览4
暂无评分
摘要
We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the e-greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the "reward-to-go" value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 2(10) states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the "original" e-greedy algorithm that commonly appears in the literature.
更多
查看译文
关键词
multiarmed bandits,epsilon-greedy method,Markov decision process (MDP),sampling,optimization under uncertainties
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要