Efficient policy detecting and reusing for non-stationarity in Markov games

AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS(2020)

引用 13|浏览144
暂无评分
摘要
One challenging problem in multiagent systems is to cooperate or compete with non-stationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a.k.a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games.
更多
查看译文
关键词
Non-stationary agents, Deep reinforcement learning, Opponent modeling, Bayesian policy reuse
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要