DualDICE - Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
NeurIPS, pp. 2315-2325, 2019.
In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which qu...More
PPT (Upload PPT)