Learning to Play Sequential Games versus Unknown Opponents

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views97
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have shown that our approach can be specialized to repeated Stackelberg games and demonstrated its applicability in experiments from traffic routing and wildlife conservation

Abstract:

We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action. We seek to design strategies for the learner to successfully interact with the opponent. While most previous approaches consider known opponent models, we focus on the setting in which the opponent's model is unk...More

Code:

Data:

0
Introduction
  • Several important real-world problems involve sequential interactions between two parties.
  • These problems can often be modeled as two-player games, where the first player chooses a strategy and the second player responds to it.
  • An additional challenge for the learner in such repeated games lies in facing a potentially different type of opponent at every game round.
  • The learner can even face an adversarially chosen sequence of opponent/attacker types [3]
Highlights
  • Several important real-world problems involve sequential interactions between two parties
  • We have considered the problem of learning to play repeated sequential games versus unknown opponents
  • We have proposed an online algorithm for the learner, when facing adversarial opponents, that attains sublinear regret guarantees by imposing kernel-based regularity assumptions on the opponents’ response function
  • We have shown that our approach can be specialized to repeated Stackelberg games and demonstrated its applicability in experiments from traffic routing and wildlife conservation
  • Our approach is motivated by sequential decision-making problems that arise in several domains such as road traffic, markets, and security applications with potentially significant societal benefits
  • It is important that the integrity and the reliability of such data are verified, and that the used algorithms are complemented with suitable measures that ensure the safety of the system at any point in time
Methods
  • The authors evaluate the proposed algorithms in traffic routing and wildlife conservation tasks.

    4.1 Routing Vehicles in Congested Traffic Networks

    The authors use the road traffic network of Sioux-Falls [20], which can be represented as a directed graph with 24 nodes and 76 edges e ∈ E.
  • The authors consider the traffic routing task in which the goal of the network operator is to route 300 units between the two nodes of the network.
  • The goal of the operator is to avoid the network becoming overly congested.
  • The routing plan chosen by the network operator can be represented by the vector xt ∈
Conclusion
  • The authors have considered the problem of learning to play repeated sequential games versus unknown opponents.
  • The authors' approach is motivated by sequential decision-making problems that arise in several domains such as road traffic, markets, and security applications with potentially significant societal benefits.
  • In such domains, it is important to predict how the system responds to any given decision and take this into account to achieve the desired performance.
  • It is important that the integrity and the reliability of such data are verified, and that the used algorithms are complemented with suitable measures that ensure the safety of the system at any point in time
Summary
  • Introduction:

    Several important real-world problems involve sequential interactions between two parties.
  • These problems can often be modeled as two-player games, where the first player chooses a strategy and the second player responds to it.
  • An additional challenge for the learner in such repeated games lies in facing a potentially different type of opponent at every game round.
  • The learner can even face an adversarially chosen sequence of opponent/attacker types [3]
  • Objectives:

    The authors' goal is to bound the learner’s cumulative regret R(T ) = maxx∈X
  • Methods:

    The authors evaluate the proposed algorithms in traffic routing and wildlife conservation tasks.

    4.1 Routing Vehicles in Congested Traffic Networks

    The authors use the road traffic network of Sioux-Falls [20], which can be represented as a directed graph with 24 nodes and 76 edges e ∈ E.
  • The authors consider the traffic routing task in which the goal of the network operator is to route 300 units between the two nodes of the network.
  • The goal of the operator is to avoid the network becoming overly congested.
  • The routing plan chosen by the network operator can be represented by the vector xt ∈
  • Conclusion:

    The authors have considered the problem of learning to play repeated sequential games versus unknown opponents.
  • The authors' approach is motivated by sequential decision-making problems that arise in several domains such as road traffic, markets, and security applications with potentially significant societal benefits.
  • In such domains, it is important to predict how the system responds to any given decision and take this into account to achieve the desired performance.
  • It is important that the integrity and the reliability of such data are verified, and that the used algorithms are complemented with suitable measures that ensure the safety of the system at any point in time
Related work
  • Most previous works consider sequential games where the goal is to play against a single type of opponent. Authors of [21] and [27] show that an optimal strategy for the learner can be obtained by observing a polynomial number of opponent’s responses. In security applications, methods by [33] and [18] learn the opponent’s response function by using PAC-based and decisiontree behavioral models, respectively. Recently, single opponent modeling has also been studied in the context of deep reinforcement learning, e.g., [13, 29, 35, 12]. While all these approaches exhibit good empirical performance, they do not consider multiple types of opponents and lack regret guarantees.

    Playing against multiple types of opponents has been considered in Bayesian Stackelberg games [26, 15, 24], where the opponent’s types are drawn from a known probability distribution. In [4], the authors propose no-regret algorithms when opponents’ behavioral models are available to the learner. In this work, we make no such distributional or availability assumptions, and our results hold for adversarially selected sequences of opponent’s types. This is similar to the work [3], in which the authors propose a no-regret online learning algorithm to play repeated Stackelberg games [37]. In contrast, we consider a more challenging setting in which opponents’ utilities are unknown and focus on learning the opponent’s response function from observing the opponent’s responses.
Funding
  • This work was gratefully supported by the Swiss National Science Foundation, under the grant SNSF 200021_172781, by the European Union’s ERC grant 815943, and the ETH Zürich Postdoctoral Fellowship 19-2 FEL-47
Study subjects and analysis
pairs: 552
We let the type vector θt. R5≥502 represent the demand profile of the network users at round t, where each entry indicates the number of users that want to travel between any pair (552 pairs in total) of nodes in the network. The network

Reference
  • Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. 2013.
    Google ScholarFindings
  • Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput., 32(1):48–77, January 2003.
    Google ScholarLocate open access versionFindings
  • Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment Without Regrets: Online Learning in Stackelberg Security Games. In ACM Conference on Economics and Computation (EC), 2015.
    Google ScholarLocate open access versionFindings
  • Lorenzo Bisi, Giuseppe De Nittis, Francesco Trovò, Marcello Restelli, and Nicola Gatti. Regret Minimization Algorithms for the Followers Behaviour Identification in Leadership Games. In Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
    Google ScholarLocate open access versionFindings
  • Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Learning Optimal Commitment to Overcome Insecurity. In Conference on Neural Information Processing Systems (NeurIPS), 2014.
    Google ScholarLocate open access versionFindings
  • Andreea Bobu, Dexter R. R. Scobee, Jaime F. Fisac, S. Shankar Sastry, and Anca D. Dragan. LESS is More: Rethinking Probabilistic Models of Human Behavior. 2020.
    Google ScholarFindings
  • Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
    Google ScholarFindings
  • Sayak Ray Chowdhury and Aditya Gopalan. On Kernelized Multi-armed Bandits. In International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • Nando de Freitas, Alex Smola, and Masrour Zoghi. Regret bounds for deterministic Gaussian process bandits. ArXiv, abs/1203.2177, 2012.
    Findings
  • Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. Bilevel Programming for Hyperparameter Optimization and Meta-Learning. ArXiv, abs/1806.04910, 2018.
    Findings
  • Yoav Freund and Robert E Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci., 55(1):119–139, 1997.
    Google ScholarLocate open access versionFindings
  • Víctor Gallego, Roi Naveiro, David Ríos Insua, and David Gomez-Ullate Oteiza. Opponent Aware Reinforcement Learning. ArXiv, abs/1908.08773, 2019.
    Findings
  • He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé. Opponent Modeling in Deep Reinforcement Learning. In International Conference on Machine Learning (ICML), 2016.
    Google ScholarLocate open access versionFindings
  • Xiuli He, Ashutosh Prasad, Suresh P. Sethi, and Genaro J. Gutierrez. A survey of Stackelberg differential game models in supply and marketing channels. Journal of Systems Science and Systems Engineering, 16(4):385–413, 2007.
    Google ScholarLocate open access versionFindings
  • Manish Jain, Christopher Kiekintveld, and Milind Tambe. Quality-bounded solutions for finite Bayesian Stackelberg games: scaling up. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2011.
    Google ScholarLocate open access versionFindings
  • Manish Jain, Dmytro Korzhyk, Ondrej Vanek, Vincent Conitzer, Michal Pechoucek, and Milind Tambe. A Double Oracle Algorithm for Zero-Sum Security Games on Graphs. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2011.
    Google ScholarLocate open access versionFindings
  • Debarun Kar, Fei Fang, Francesco Maria Delle Fave, Nicole D. Sintov, and Milind Tambe. "A Game of Thrones": When Human Behavior Models Compete in Repeated Stackelberg Security Games. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2015.
    Google ScholarLocate open access versionFindings
  • Debarun Kar, Benjamin J. Ford, Shahrzad Gholami, Fei Fang, Andrew J. Plumptre, Milind Tambe, Margaret Driciru, Fred Wanyama, Aggrey Rwetsiba, Mustapha Nsubaga, and Joshua Mabonga. Cloudy with a Chance of Poaching: Adversary Behavior Modeling and Forecasting with Real-World Poaching Data. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2017.
    Google ScholarLocate open access versionFindings
  • Yannis A. Korilis, Aurel A. Lazar, and Ariel Orda. Achieving Network Optima Using Stackelberg Routing Strategies. IEEE/ACM Trans. Netw., 5(1):161–173, 1997.
    Google ScholarLocate open access versionFindings
  • Larry J. LeBlanc, Edward K. Morlok, and William P. Pierskalla. An efficient approach to solving the road network equilibrium traffic assignment problem. In Transportation Research Vol. 9, pages 309–318, 1975.
    Google ScholarLocate open access versionFindings
  • Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and Approximating the Optimal Strategy to Commit To. In International Symposium on Algorithmic Game Theory (SAGT), 2009.
    Google ScholarLocate open access versionFindings
  • D. Liao-McPherson, M. Huang, and I. Kolmanovsky. A Regularized and Smoothed Fischer–Burmeister Method for Quadratic Programming With Applications to Model Predictive Control. IEEE Transactions on Automatic Control, 64(7):2937–2944, 2019.
    Google ScholarLocate open access versionFindings
  • N. Littlestone and M.K. Warmuth. The Weighted Majority Algorithm. Information and Computation, 108(2):212 – 261, 1994.
    Google ScholarLocate open access versionFindings
  • Janusz Marecki, Gerry Tesauro, and Richard Segal. Playing Repeated Stackelberg Games with Unknown Opponents. In International Joint Conference on Autonomous Agents and Multi-agent Systems (AAMAS), 2012.
    Google ScholarLocate open access versionFindings
  • Thanh H. Nguyen, Rong Yang, Amos Azaria, Sarit Kraus, and Milind Tambe. Analyzing the Effectiveness of Adversary Modeling in Security Games. In AAAI Conference on Artificial Intelligence, 2013.
    Google ScholarLocate open access versionFindings
  • Praveen Paruchuri, Jonathan P. Pearce, Janusz Marecki, Milind Tambe, Fernando Ordóñez, and Sarit Kraus. Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2008.
    Google ScholarLocate open access versionFindings
  • Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning Optimal Strategies to Commit To. In AAAI Conference on Artificial Intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • James Pita, Manish Jain, Fernando Ordóñez, Christopher Portway, Milind Tambe, Craig Western, Praveen Paruchuri, and Sarit Kraus. Using Game Theory for Los Angeles Airport Security. AI Magazine, 30:43–57, 2009.
    Google ScholarLocate open access versionFindings
  • Roberta Raileanu, Emily L. Denton, Arthur Szlam, and Rob Fergus. Modeling Others using Oneself in Multi-Agent Reinforcement Learning. ArXiv, abs/1802.09640, 2018.
    Findings
  • Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71.
    Google ScholarLocate open access versionFindings
  • Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. No-Regret Learning in Unknown Games with Correlated Payoffs. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276–295, 2017.
    Google ScholarLocate open access versionFindings
  • Arunesh Sinha, Debarun Kar, and Milind Tambe. Learning Adversary Behavior in Security Games: A PAC Model Perspective. In International Conference on Autonomous Agents & Multiagent Systems (AAMAS), 2016.
    Google ScholarLocate open access versionFindings
  • Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
    Google ScholarLocate open access versionFindings
  • Zheng Tian, Ying Wen, Zhichen Gong, Faiz Punakkath, Shihao Zou, and Jun Wang. A Regularized Opponent Model with Maximum Entropy Objective. In International Joint Conference on Artificial Intelligence (IJCAI), 2019.
    Google ScholarLocate open access versionFindings
  • Transportation Networks for Research Core Team. https://github.com/bstabler/ TransportationNetworks. Transportation Networks for Research.
    Findings
  • H. von Stackelberg. Marktform und Gleichgewicht. Die Handelsblatt-Bibliothek "Klassiker der Nationalökonomie". J. Springer, 1934.
    Google ScholarFindings
  • Rong Yang, Benjamin J. Ford, Milind Tambe, and Andrew Lemieux. Adaptive resource allocation for wildlife protection against illegal poachers. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2014.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments