Xcs With Combined Reward Method (Xcscr) For Policy Search In Multistep Problems
2019 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC)(2019)
摘要
A reward mechanism is critical for a Reinforcement Learning agent to learn action policies from rewards. The reward mechanism establishes a policy by estimating contributions of constituents of the policy to a reward. Traditionally, rewards from an environment have two categories: long-term rewards for guiding the policy learning process, and short-term rewards for optimisation. However, long-term, positive rewards are scarce at the initial learning phase in multistep problems such that existing reward mechanisms lack sufficient stimulus to learn policies effectively. This paper proposes XCSCR, an Accuracy-based Learning Classifier System (XCS) algorithm with a combined reward (CR) method, to guide the search for global optimal policies in multistep maze problems. The XCSCR discriminates long-term and short-term rewards through four novel reward-assignment mechanisms: 1) A short-term reward mechanism encourages exploration of the RL agent searching for policies based on short-term rewards. 2) An imprinting mechanism amends the negative impact of indiscriminate rewards between exploration and exploitation. 3) A learning-rate switching mechanism emphasises the impact of long-term positive rewards in the policy searching process. 4) A learning step-threshold mechanism creates an optimisation pressure for policies. Experiments were conducted in three maze environments as this enabled the effects of XCSCR on policies to interpreted easily. Results show that the XCSCR enables learning the optimum path-finding policies quicker and more often than previous XCS algorithms. The XCSCR's improvements for the policy search will facilitate real-world applications, e.g. robotic applications.
更多查看译文
关键词
reward mechanism, scarce/sparse reward, Reinforcement Learning, multistep, maze problems, XCS, XCSCR
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络