Distributional Off-policy Evaluation with Bellman Residual Minimization
CoRR(2024)
摘要
We consider the problem of distributional off-policy evaluation which serves
as the foundation of many distributional reinforcement learning (DRL)
algorithms. In contrast to most existing works (that rely on supremum-extended
statistical distances such as supremum-Wasserstein distance), we study the
expectation-extended statistical distance for quantifying the distributional
Bellman residuals and show that it can upper bound the expected error of
estimating the return distribution. Based on this appealing property, by
extending the framework of Bellman residual minimization to DRL, we propose a
method called Energy Bellman Residual Minimizer (EBRM) to estimate the return
distribution. We establish a finite-sample error bound for the EBRM estimator
under the realizability assumption. Furthermore, we introduce a variant of our
method based on a multi-step bootstrapping procedure to enable multi-step
extension. By selecting an appropriate step level, we obtain a better error
bound for this variant of EBRM compared to a single-step EBRM, under some
non-realizability settings. Finally, we demonstrate the superior performance of
our method through simulation studies, comparing with several existing methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要