CS/CNS/EE 253: Advanced Topics in Machine Learning Topic: Dealing with Partial Feedback #1

msra（2013）

引用 23|浏览24

暂无评分

摘要

function ri(t) which is unknown. In each round t, an arm i is chosen and the reward ri(t)2 (0; 1) is gained. Only ri(t) is revealed to the algorithm at the end of round t, where i is the arm chosen in that round; it is kept ignorant of rj(t) for all other arms j6= i. The goal is to nd an algorithm specifying how to choose an arm in each round that will maximize the total reward over all rounds. We began our study of this model with an assumption of stochastic rewards, as opposed to the harder adversarial rewards case. Thus we assume there is an underlying distributionRi for each arm i, and each ri(t) is drawn fromRi independently of all other rewards (both of arm i during rounds other than t, and of other arms during round t). Note we assume the rewards are bounded; specically, ri(t)2 (0; 1) for all i and t. We rst explored the t-Greedy algorithm in which with probability t an arm is chosen uniformly at random, and with probability 1 t the arm with the highest observed average reward is chosen. For the right choice of t, this algorithm has expected regret logarithmic in T.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要