Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

Zeyuan Allen-Zhu,Ebastien Bubeck,Yuanzhi Li

ICML（2018）

引用 28|浏览62

暂无评分

摘要

Regret bounds in online learning compare the player's performance to L*, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon T. The more refined concept of first-order regret bound replaces this with a scaling root L* , which may be much smaller than root T. It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem (Agarwal et al., 2017), Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.(1)

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要