Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

JOURNAL OF MACHINE LEARNING RESEARCH(2023)

引用 2|浏览579
暂无评分
摘要
Model-based reinforcement learning (RL), which finds an optimal policy after establishing an empirical model, has long been recognized as one of the cornerstones of RL. It is espe-cially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, we aim to ad -dress the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of (O) over tilde(vertical bar S vertical bar vertical bar A vertical bar vertical bar B vertical bar(1 - gamma)(-3)epsilon(-2)) for finding the Nash equilibrium (NE) value up to some epsilon error, and the epsilon-NE policies with a smooth planning oracle, where gamma is the discount factor, and S, A, B denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, where the sample complexity lower bound is (Omega) over tilde(vertical bar S vertical bar(vertical bar A vertical bar + vertical bar B vertical bar)(1 - gamma)(-3)epsilon(-2)), and this model-based approach is near-optimal with only a gap on the vertical bar A vertical bar,vertical bar B vertical bar dependence. Our results not only illustrate the sample-efficiency of this basic model-based MARL ap-proach, but also elaborate on the fundamental tradeoff between its power (easily handling the reward-agnostic case) and limitation (less adaptive and suboptimal in vertical bar A vertical bar, vertical bar B vertical bar), which particularly arises in the multi-agent context.
更多
查看译文
关键词
Multi-Agent RL,Zero-Sum Markov Games,Near-Optimal Sample Complexity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要