Approximate Exploitability: Learning a Best Response

Edward Lockhart,Marc Lanctot,Martin Schmid,Julian Schrittwieser, Thomas

European Conference on Artificial Intelligence（2021）

引用 9|浏览3

暂无评分

摘要

A standard metric used to measure the approximate optimality of policies in imperfect information games is exploitability, i.e. the performance of a policy against its worst-case opponent. However, exploitability is intractable to compute in large games as it requires a full traversal of the game tree to calculate a best response to the given policy. We introduce a new metric, approximate exploitability, that calculates an analogous metric using an approximate best response; the approximation is done by using search and reinforcement learning. This is a generalization of local best response, a domain speciﬁc evaluation metric used in poker. We provide empirical results for a speciﬁc instance of the method, demonstrating that our method converges to exploitability in the tabular and function approximation settings for small games. In large games, our method learns to exploit both strong and weak agents, learning to exploit an AlphaZero agent. 1

查看译文

关键词

Machine Learning: Reinforcement Learning,Agent-based and Multi-agent Systems: Multi-agent Learning,Agent-based and Multi-agent Systems: Noncooperative Games

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要