Minimax Value Interval for Off-Policy Evaluation and Policy Optimization.

NeurIPS(2020)

引用 75|浏览35
暂无评分
摘要
We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain:(1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases?(2) They are split into two styles (“weight-learning” vs “value-learning”). Can we unify them?In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style), we unify them into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class. Our interval also provides a unified view of and new insights to some recent methods, and we further explore the implications of our results on exploration and exploitation in off-policy policy optimization with insufficient data coverage.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要