Block policy mirror descent

arxiv(2023)

引用 0|浏览9
暂无评分
摘要
In this paper, we present a new policy gradient (PG) method, namely, the block policy mirror descent (BPMD) method, for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, the BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to worst-case total computational complexity comparable to batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting by utilizing stochastic first-order information constructed from samples. With a generative model, (O) over tilde (|S| |A|/is an element of) (resp., (O) over tilde (|S| |A|/is an element of(2))) sample complexities are established for the strongly convex (resp., non-strongly convex) regularizers, where \epsilon denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.
更多
查看译文
关键词
Markov decision process,reinforcement learning,policy gradient,mirror descent,block coordinate decent,iteration and sample complexity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要