Settling Constant Regrets in Linear Markov Decision Processes
arxiv(2024)
摘要
We study the constant regret guarantees in reinforcement learning (RL). Our
objective is to design an algorithm that incurs only finite regret over
infinite episodes with high probability. We introduce an algorithm,
Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where
both the transition kernel and the reward function can be approximated by some
linear function up to misspecification level ζ. At the core of
Cert-LSVI-UCB is an innovative certified estimator, which facilitates a
fine-grained concentration analysis for multi-phase value-targeted regression,
enabling us to establish an instance-dependent regret bound that is constant
w.r.t. the number of episodes. Specifically, we demonstrate that for an MDP
characterized by a minimal suboptimality gap Δ, Cert-LSVI-UCB has a
cumulative regret of 𝒪̃(d^3H^5/Δ) with high
probability, provided that the misspecification level ζ is below
𝒪̃(Δ / (√(d)H^2)). Remarkably, this regret bound
remains constant relative to the number of episodes K. To the best of our
knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant,
instance-dependent, high-probability regret bound in RL with linear function
approximation for infinite runs without relying on prior distribution
assumptions. This not only highlights the robustness of Cert-LSVI-UCB to model
misspecification but also introduces novel algorithmic designs and analytical
techniques of independent interest.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要