Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Jonathan Gray
Jonathan Gray
Anton Bakhtin
Anton Bakhtin

ICLR 2021, 2021.

Cited by: 0|Bibtex|Views200
Other Links: arxiv.org
Weibo:
We present an agent that approximates a one-step equilibrium in no-press Diplomacy using no-regret learning and show that it exceeds human-level performance

Abstract:

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for t...More

Code:

Data:

0
Introduction
  • A primary goal for AI research is to develop agents that can act optimally in real-world multi-agent interactions.
  • Real-world games, such as business negotiations, politics, and traffic navigation, involve a far more complex mixture of cooperation and competition
  • In such settings, the theoretical grounding for the techniques used in previous AI breakthroughs falls apart.
  • Similar to the Pluribus poker agent (Brown & Sandholm, 2019b), the search technique uses external regret matching to compute an approximate equilibrium.
  • The recent focus on "Deep" MARL has led to learning rules from game theory such as fictitious play and regret minimization being adapted to Deep reinforcement learning (Heinrich & Silver, 2016; Brown et al, 2019), as well as work on game-theoretic challenges of mixed cooperative/competitive settings such as social dilemmas and multiple equilibria in the MARL setting (Leibo et al, 2017; Lerer & Peysakhovich, 2017; 2019)
Highlights
  • A primary goal for AI research is to develop agents that can act optimally in real-world multi-agent interactions
  • The recent focus on "Deep" multi-agent reinforcement learning (MARL) has led to learning rules from game theory such as fictitious play and regret minimization being adapted to Deep reinforcement learning (Heinrich & Silver, 2016; Brown et al, 2019), as well as work on game-theoretic challenges of mixed cooperative/competitive settings such as social dilemmas and multiple equilibria in the MARL setting (Leibo et al, 2017; Lerer & Peysakhovich, 2017; 2019)
  • While external regret minimization has been behind previous AI breakthroughs in purely competitive games such as poker, it was never previously shown to be successful in a complex game involving cooperation
  • The success of external regret matching (ERM) in no-press Diplomacy suggests that its use is not limited to purely adversarial games
  • Combining search with reinforcement learning has led to tremendous success in perfect-information games (Silver et al, 2018) and more recently in two-player zero-sum imperfect-information games as well (Brown et al, 2020)
  • We show that our agent soundly defeats previous agents, that our agent is far less exploitable than previous agents, that an expert human cannot exploit our agent even in repeated play, and, most importantly, that our agent achieves a score of 25.6% when playing anonymously with humans on a popular Diplomacy website, compared to an average human score of 14.3%
  • It remains to be seen whether similar search techniques can be developed for variants of Diplomacy that allow for coordination between agents
Results
  • Using the techniques described in Section 3, the authors developed an agent the authors call SearchBot.
  • The first evaluates the head-to-head performance of SearchBot playing against the population of human players on a popular Diplomacy website, as well as against prior AI agents.
  • The second measures the exploitability of SearchBot. 4.1 PERFORMANCE AGAINST A POPULATION OF HUMAN PLAYERS.
  • The authors had SearchBot anonymously play no-press Diplomacy games on the popular Diplomacy website webdiplomacy.net.
  • Since there are 7 players in each game, average human performance is a score of 14.3%.
  • SearchBot scored 25.6% ± 4.8%.6.
  • SearchBot scored 25.6% ± 4.8%.6 The agent’s performance is shown in Table 2 and a detailed breakdown is presented in Table 5 in Appendix F
Conclusion
  • No-press Diplomacy is a complex game involving both cooperation and competition that poses major theoretical and practical challenges for past AI techniques.
  • Developing search techniques that scale more effectively with the depth of the game tree may lead to substantial improvements in performance
  • Another direction is combining the search technique with reinforcement learning.
  • Combining search with reinforcement learning has led to tremendous success in perfect-information games (Silver et al, 2018) and more recently in two-player zero-sum imperfect-information games as well (Brown et al, 2020)
  • It remains to be seen whether similar search techniques can be developed for variants of Diplomacy that allow for coordination between agents
Summary
  • Introduction:

    A primary goal for AI research is to develop agents that can act optimally in real-world multi-agent interactions.
  • Real-world games, such as business negotiations, politics, and traffic navigation, involve a far more complex mixture of cooperation and competition
  • In such settings, the theoretical grounding for the techniques used in previous AI breakthroughs falls apart.
  • Similar to the Pluribus poker agent (Brown & Sandholm, 2019b), the search technique uses external regret matching to compute an approximate equilibrium.
  • The recent focus on "Deep" MARL has led to learning rules from game theory such as fictitious play and regret minimization being adapted to Deep reinforcement learning (Heinrich & Silver, 2016; Brown et al, 2019), as well as work on game-theoretic challenges of mixed cooperative/competitive settings such as social dilemmas and multiple equilibria in the MARL setting (Leibo et al, 2017; Lerer & Peysakhovich, 2017; 2019)
  • Results:

    Using the techniques described in Section 3, the authors developed an agent the authors call SearchBot.
  • The first evaluates the head-to-head performance of SearchBot playing against the population of human players on a popular Diplomacy website, as well as against prior AI agents.
  • The second measures the exploitability of SearchBot. 4.1 PERFORMANCE AGAINST A POPULATION OF HUMAN PLAYERS.
  • The authors had SearchBot anonymously play no-press Diplomacy games on the popular Diplomacy website webdiplomacy.net.
  • Since there are 7 players in each game, average human performance is a score of 14.3%.
  • SearchBot scored 25.6% ± 4.8%.6.
  • SearchBot scored 25.6% ± 4.8%.6 The agent’s performance is shown in Table 2 and a detailed breakdown is presented in Table 5 in Appendix F
  • Conclusion:

    No-press Diplomacy is a complex game involving both cooperation and competition that poses major theoretical and practical challenges for past AI techniques.
  • Developing search techniques that scale more effectively with the depth of the game tree may lead to substantial improvements in performance
  • Another direction is combining the search technique with reinforcement learning.
  • Combining search with reinforcement learning has led to tremendous success in perfect-information games (Silver et al, 2018) and more recently in two-player zero-sum imperfect-information games as well (Brown et al, 2020)
  • It remains to be seen whether similar search techniques can be developed for variants of Diplomacy that allow for coordination between agents
Tables
  • Table1: Effect of model and training data changes on supervised model quality. We measure policy accuracy as well as average SoS score achieved by each agent against 6 of the original DipNet model. We measure the SoS scores in two settings: with all 7 agents sampling orders at a temperature of either 0.5 or 0.1
  • Table2: Average SoS score of our agent in anonymous games against humans on webdiplomacy.net. Average human performance is 14.3%. Score in the case of draws was determined by the rules of the joined game. The ± shows one standard error
  • Table3: Comparison of average sum-of-squares scores for our agent (SearchBot) in 1v6 games with DipNet agents from <a class="ref-link" id="cPaquette_et+al_2019_a" href="#rPaquette_et+al_2019_a">Paquette et al (2019</a>), as well as our own blueprint imitation learning agent. All agents other than SearchBot use a temperature of 0.1
  • Table4: Average SoS score of one expert human playing against six bots under repeated play. A score less than 14.3% means the human is unable to exploit the bot. Five games were played for each power for each agent, for a total of 35 games per agent. For each power, the human first played all games against DipNet, then the blueprint model described in Section 3.1, and then finally SearchBot
  • Table5: Average SoS score of our agent in anonymous games against humans on webdiplomacy.net. Average human performance is 14.3%. Score in the case of draws was determined by the rules of the joined game. The ± shows one standard error. Average human performance was calculated based on SoS scoring of historical games on webdiplomacy.net
  • Table6: Average SoS score of one expert human playing against six bots under repeated play. A score less than 14.3% suggests the human is unable to exploit the bot. Five games were played for each power for each agent, for a total of 35 games per agent. For each power, the human first played all games against DipNet, then the blueprint model described in Section 3.1, and then finally SearchBot
Download tables as Excel
Funding
  • We show that our agent soundly defeats previous agents, that our agent is far less exploitable than previous agents, that an expert human cannot exploit our agent even in repeated play, and, most importantly, that our agent achieves a score of 25.6% when playing anonymously with humans on a popular Diplomacy website, compared to an average human score of 14.3%
  • We do not observe improved performance for rolling out farther than 3 or 4 movement phases
  • SearchBot achieves its highest 1v6 score when matched against its own blueprint, since it is most accurately able to approximate the behavior of that agent. It outperforms all three agents by a large margin, and none of the three baselines is able to achieve a score of more than 1% against our search agent
  • To control for this, we also report the bot’s performance when each of the 7 powers is weighed equally. Its score in this case increases to 27.0% ± 5.3%
  • A score below 14.3% means the humans are losing on average
Study subjects and analysis
centers: 18
A game may end in a draw on any turn if all remaining players agree. Draws are a common outcome among experienced players because players will often coordinate to prevent any individual from reaching 18 centers. The two most common scoring systems for draws are draw-size scoring (DSS), in which all surviving players equally split a win, and sum-of-squares scoring (SoS), in which player i receives a score of

Reference
  • Thomas Anthony. Ghost-ratings, 2020. URL https://sites.google.com/view/webdipinfo/ghost-ratings.
    Findings
  • Thomas Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas C Hudson, Nicolas Porcel, Marc Lanctot, Julien Pérolat, Richard Everett, et al. Learning to play no-press diplomacy with best response policy iteration. arXiv preprint arXiv:2006.04635, 2020.
    Findings
  • Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
    Findings
  • David Blackwell et al. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, 2017.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1829–1836, 2019a.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, pp. eaay2400, 2019b.
    Google ScholarLocate open access versionFindings
  • Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In International Conference on Machine Learning, pp. 793–802, 2019.
    Google ScholarLocate open access versionFindings
  • Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games. arXiv preprint arXiv:2007.13544, 2020.
    Findings
  • Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep Blue. Artificial intelligence, 134 (1-2):57–83, 2002.
    Google ScholarFindings
  • Arpad E Elo. The rating of chessplayers, past and present. Arco Pub., 1978.
    Google ScholarLocate open access versionFindings
  • Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
    Findings
  • André Ferreira, Henrique Lopes Cardoso, and Luis Paulo Reis. Dipblue: A diplomacy agent with strategic and trust reasoning. In ICAART International Conference on Agents and Artificial Intelligence, Proceedings, 2015.
    Google ScholarLocate open access versionFindings
  • Brandon Fogel. To whom tribute is due: The next step in scoring systems, 2020.
    Google ScholarFindings
  • http://windycityweasels.org/wp-content/uploads/2020/04/
    Findings
  • 2020-03-To-Whom-Tribute-Is-Due-The-Next-Step-in-Scoring-Systems.
    Google ScholarFindings
  • Amy Greenwald, Keith Hall, and Roberto Serrano. Correlated q-learning. In ICML, volume 20, pp. 242, 2003.
    Google ScholarLocate open access versionFindings
  • James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
    Google ScholarLocate open access versionFindings
  • Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
    Google ScholarLocate open access versionFindings
  • Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfectinformation games. arXiv preprint arXiv:1603.01121, 2016.
    Findings
  • Ralf Herbrich, Tom Minka, and Thore Graepel. TrueskillTM: a bayesian skill rating system. In Advances in neural information processing systems, pp. 569–576, 2007.
    Google ScholarLocate open access versionFindings
  • Ronald A Howard. Dynamic programming and markov processes. 1960.
    Google ScholarFindings
  • Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
    Google ScholarLocate open access versionFindings
  • Stefan J Johansson and Fredrik Håård. Tactical coordination in no-press diplomacy. In International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 423–430, 2005.
    Google ScholarLocate open access versionFindings
  • Sarit Kraus and Daniel Lehmann. Diplomat, an agent in a multi agent environment: An overview. In IEEE International Performance Computing and Communications Conference, pp. 434–435. IEEE Computer Society, 1988.
    Google ScholarLocate open access versionFindings
  • Sarit Kraus and Daniel Lehmann. Designing and building a negotiating automated agent. Computational Intelligence, 11(1):132–171, 1995.
    Google ScholarLocate open access versionFindings
  • Sarit Kraus, Eithan Ephrati, and Daniel Lehmann. Negotiation in a non-cooperative environment. Journal of Experimental & Theoretical Artificial Intelligence, 3(4):255–281, 1994.
    Google ScholarLocate open access versionFindings
  • Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, pp. 1078–1086, 2009.
    Google ScholarLocate open access versionFindings
  • Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037, 2017.
    Findings
  • Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.
    Findings
  • Adam Lerer and Alexander Peysakhovich. Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 107–114. ACM, 2019.
    Google ScholarLocate open access versionFindings
  • Adam Lerer, Hengyuan Hu, Jakob Foerster, and Noam Brown. Improving policies via search in cooperative partially observable games. In AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pp. 322– 328, 2001.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Matej Moravcík, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
    Google ScholarLocate open access versionFindings
  • John Nash. Non-cooperative games. Annals of mathematics, pp. 286–295, 1951.
    Google ScholarFindings
  • Philip Paquette, Yuchen Lu, Seton Steven Bocco, Max Smith, O-G Satya, Jonathan K Kummerfeld, Joelle Pineau, Satinder Singh, and Aaron C Courville. No-press diplomacy: Modeling multi-agent gameplay. In Advances in Neural Information Processing Systems, pp. 4474–4485, 2019.
    Google ScholarLocate open access versionFindings
  • Yoav Shoham, Rob Powers, and Trond Grenager. Multi-agent reinforcement learning: a critical survey. Web manuscript, 2, 2003.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140– 1144, 2018.
    Google ScholarLocate open access versionFindings
  • Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, pp. 2989–2997, 2015.
    Google ScholarLocate open access versionFindings
  • Gerald Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994.
    Google ScholarLocate open access versionFindings
  • Jason van Hal. Diplomacy AI - Albert, 2013. URL https://sites.google.com/site/diplomacyai/.
    Findings
  • Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pp. 1729–1736, 2008.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments