Towards Minimax Optimal Reward-free Reinforcement Learning in Linear MDPs

ICLR 2023(2023)

引用 1|浏览33
暂无评分
摘要
We study reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, an agent first interacts with the environment without accessing the reward function in the exploration phase. In the subsequent planning phase, it is given a reward function and asked to output an $\epsilon$-optimal policy. We propose a novel algorithm LSVI-RFE under the linear MDP setting, where the transition probability and reward functions are linear in a feature mapping. We prove an $\widetilde{O}(H^{4} d^{2}/\epsilon^2)$ sample complexity upper bound for LSVI-RFE, where $H$ is the episode length and $d$ is the feature dimension. We also establish a sample complexity lower bound of $\Omega(H^{3} d^{2}/\epsilon^2)$. To the best of our knowledge, LSVI-RFE is the first computationally efficient algorithm that achieves the minimax optimal sample complexity in linear MDP settings up to an $H$ and logarithmic factors. Our LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism to avoid overly-conservative exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on $H$ and $d$, respectively.
更多
查看译文
关键词
Reinforce Learning,Reward-Free Exploration,Linear MDPs,Learning Theory
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要