DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

ICLR 2023(2023)

引用 0|浏览18
暂无评分
摘要
In this work, we decouple the iterative (bi-level) offline RL optimization from the offline training phase, forming a non-iterative bi-level learning paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (ie, employing policy/value regularization), while performing outer-level optimization in testing (ie, conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative methods): (Q1) What information should we transfer from inner-level to outer-level? (Q2) What should we pay attention to when using the transferred information in outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization, we proposed DROP, which fully answered the above three questions. Particularly, in inner-level, DROP decomposes offline data into multiple subsets, and learns a score model (Q1). To keep safe exploitation to score model in outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (Q2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (Q3). Empirically, we evaluate DROP on various benchmarks, showing that DROP gains comparable or better performance compared to prior offline RL methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要