Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

arxiv(2022)

引用 3|浏览2
暂无评分
摘要
We propose a unified framework to study policy evaluation (PE) and the associated tempo-ral difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean-square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean-square sense. This method interprets the classical gradi-ent Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with test functions. Solving these equations in different ways recovers various classical TD algorithms, such as TD(lambda), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approx-imate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero, and we provide the convergence rate. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.
更多
查看译文
关键词
martingale approach,policy,continuous time
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要