DualDICE: Efficient Estimation of Off Policy Stationary Distribution Corrections

user-5f8cfb314c775ec6fa691ca8(2019)

引用 0|浏览26
暂无评分
摘要
In many real-world reinforcement learning domains, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new target policy, accurate estimates of stationary distribution ratios–correction terms which quantify the likelihood that the target policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset–can improve accuracy and performance. In this work, we derive and study an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to collect the dataset. Furthermore, our algorithm eschews any use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation, and we find that our algorithm yields significant accuracy improvements compared to competing techniques.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要