Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning
CoRR(2023)
摘要
By reusing data throughout training, off-policy deep reinforcement learning
algorithms offer improved sample efficiency relative to on-policy approaches.
For continuous action spaces, the most popular methods for off-policy learning
include policy improvement steps where a learned state-action ($Q$) value
function is maximized over selected batches of data. These updates are often
paired with regularization to combat associated overestimation of $Q$ values.
With an eye toward safety, we revisit this strategy in environments with
"mixed-sign" reward functions; that is, with reward functions that include
independent positive (incentive) and negative (cost) terms. This setting is
common in real-world applications, and may be addressed with or without
constraints on the cost terms. We find the combination of function
approximation and a term that maximizes $Q$ in the policy update to be
problematic in such environments, because systematic errors in value estimation
impact the contributions from the competing terms asymmetrically. This results
in overemphasis of either incentives or costs and may severely limit learning.
We explore two remedies to this issue. First, consistent with prior work, we
find that periodic resetting of $Q$ and policy networks can be used to reduce
value estimation error and improve learning in this setting. Second, we
formulate novel off-policy actor-critic methods for both unconstrained and
constrained learning that do not explicitly maximize $Q$ in the policy update.
We find that this second approach, when applied to continuous action spaces
with mixed-sign rewards, consistently and significantly outperforms
state-of-the-art methods augmented by resetting. We further find that our
approach produces agents that are both competitive with popular methods overall
and more reliably competent on frequently-studied control problems that do not
have mixed-sign rewards.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要