Using Options for Long-Horizon Off-Policy Evaluation.

arXiv: Artificial Intelligence(2017)

引用 23|浏览51
暂无评分
摘要
Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, we show that the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, to address this long-horizon problem. We show theoretically and experimentally that combining importance sampling with options-based policies can significantly improve performance for long-horizon problems.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要