Nearly Optimal Reward-Free Reinforcement Learning

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139(2021)

引用 25|浏览202
暂无评分
摘要
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases: in the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal; in the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm, Staged Sampling + Truncated Planning (SSTP), which interacts with the environment at most O(S(2)A/epsilon(2) poly log (SAH/epsilon)). episodes in the exploration phase, and guarantees to output a nearoptimal policy for arbitrary reward functions in the planning phase, where S is the size of state space, A is the size of action space, H is the planning horizon, and epsilon is the target accuracy relative to the total reward. Notably, our sample complexity scales only logarithmically with H, in contrast to all existing results which scale polynomially with H. Furthermore, this bound matches the minimax lower bound Omega(S(2)A/epsilon(2)) up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an epsilon-suboptimal policy; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要