Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning.

AISTATS(2021)

引用 24|浏览25
暂无评分
摘要
While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness-or the lack thereof-remains an important issue that has not been fully explored. In this paper, we provide a distributionally robust formulation offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment that can deviate from the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves O-P(1/root n) regret, meaning that with high probability, the policy learned from using n training data points will be O(1/root n) close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要