Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
NeurIPS(2023)
摘要
While numerous works have focused on devising efficient algorithms for
reinforcement learning (RL) with uniformly bounded rewards, it remains an open
question whether sample or time-efficient algorithms for RL with large
state-action space exist when the rewards are heavy-tailed, i.e., with
only finite (1+ϵ)-th moments for some ϵ∈(0,1]. In this
work, we address the challenge of such rewards in RL with linear function
approximation. We first design an algorithm, Heavy-OFUL, for
heavy-tailed linear bandits, achieving an instance-dependent T-round
regret of Õ(d T^1-ϵ/2(1+ϵ)√(∑_t=1^T ν_t^2) + d T^1-ϵ/2(1+ϵ)), the
first of this kind. Here, d is the feature dimension, and
ν_t^1+ϵ is the (1+ϵ)-th central moment of the reward at
the t-th round. We further show the above bound is minimax optimal when
applied to the worst-case instances in stochastic and deterministic linear
bandits. We then extend this algorithm to the RL settings with linear function
approximation. Our algorithm, termed as Heavy-LSVI-UCB, achieves the
first computationally efficient instance-dependent K-episode
regret of Õ(d √(H 𝒰^*) K^1/1+ϵ + d
√(H 𝒱^* K)). Here, H is length of the episode, and
𝒰^*, 𝒱^* are instance-dependent quantities scaling with
the central moment of reward and value functions, respectively. We also provide
a matching minimax lower bound Ω(d H K^1/1+ϵ + d
√(H^3 K)) to demonstrate the optimality of our algorithm in the worst
case. Our result is achieved via a novel robust self-normalized concentration
inequality that may be of independent interest in handling heavy-tailed noise
in general online regression problems.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要