Off-Policy Evaluation in Markov Decision Processes under Weak Distributional Overlap
CoRR(2024)
摘要
Doubly robust methods hold considerable promise for off-policy evaluation in
Markov decision processes (MDPs) under sequential ignorability: They have been
shown to converge as 1/√(T) with the horizon T, to be statistically
efficient in large samples, and to allow for modular implementation where
preliminary estimation tasks can be executed using standard reinforcement
learning techniques. Existing results, however, make heavy use of a strong
distributional overlap assumption whereby the stationary distributions of the
target policy and the data-collection policy are within a bounded factor of
each other – and this assumption is typically only credible when the state
space of the MDP is bounded. In this paper, we re-visit the task of off-policy
evaluation in MDPs under a weaker notion of distributional overlap, and
introduce a class of truncated doubly robust (TDR) estimators which we find to
perform well in this setting. When the distribution ratio of the target and
data-collection policies is square-integrable (but not necessarily bounded),
our approach recovers the large-sample behavior previously established under
strong distributional overlap. When this ratio is not square-integrable, TDR is
still consistent but with a slower-than-1/√(T); furthermore, this rate of
convergence is minimax over a class of MDPs defined only using mixing
conditions. We validate our approach numerically and find that, in our
experiments, appropriate truncation plays a major role in enabling accurate
off-policy evaluation when strong distributional overlap does not hold.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要