Learning Expensive Coordination: An Event-Based Deep RL Approach

Runsheng Yu
Runsheng Yu
Xinrun Wang
Xinrun Wang
Youzhi Zhang
Youzhi Zhang
Hanjiang Lai
Hanjiang Lai

ICLR, 2020.

Cited by: 0|Bibtex|Views64
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We address the two issues, the leader’s long-term decision process and the complex interactions between the leader and followers, with three key steps: we model the leader’s decision-making process as a semiMarkov Decision Process and propose a novel event-based policy gradient t...

Abstract:

Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league...More

Code:

Data:

Introduction
  • Deep Multi-Agent Reinforcement Learning (MARL) has been widely used in coordinating cooperative agents to jointly complete certain tasks where the agent is assumed to be selfless, i.e., the agent is willing to sacrifice itself to maximize the team reward.
  • Forcing the drivers to selflessly contribute may increase the income for the company in a short-term but it will causes the low efficient and unsustainable of that company in the long run because the unsatisfied drivers may be demotivated and even leave the company
  • Another important example is that the government wants some companies to invest on the poverty area to achieve the fairness of the society, which may inevitably reduce the profits of companies.
  • We solve the large-scale sequential expensive coordination problem with a novel RL training scheme
Highlights
  • Deep Multi-Agent Reinforcement Learning (MARL) has been widely used in coordinating cooperative agents to jointly complete certain tasks where the agent is assumed to be selfless, i.e., the agent is willing to sacrifice itself to maximize the team reward
  • We address the two issues, the leader’s long-term decision process and the complex interactions between the leader and followers, with three key steps: (a) we model the leader’s decision-making process as a semiMarkov Decision Process and propose a novel event-based policy gradient to take actions only at important time steps to avoid myopic policy; (b) to accurately predict followers’ behaviors, we construct a follower-aware module based on the leader-follower consistency, including a novel follower-specific attention mechanism, and a sequential decision module to predict followers’ behaviors precisely and make accurate response to these behaviors; and (c) an action abstractionbased policy gradient method for followers is proposed to simplify the decision process for the followers and simplify the interaction between leader and followers, and accelerate the convergence of the training process
  • This paper proposes a novel RL training scheme for Stackelberg Markov Games with single leader and multiple self-interested followers, which considers the leader’s long-term decision process and complicated interaction between followers with three contributions
  • 1) To consider the long-term effect of the leader’s behavior, we develop an event-based policy gradient for the leader’s policy
  • 3) We propose an action abstraction-based policy gradient algorithm to accelerate the training process of followers
  • We are willing to highlight that Stackelberg Markov Games contribute to the RL community with three key aspects: 1)
Methods
  • Ours total incentive M3RL total incentive

    Ours total reward M3RL total reward

    0% Noise 18.32 4.06 10.06 -1.58

    30% Noise 17.63 3.85 5.36 -3.23

    50% Noise 17.28 4.02 5.30 -8.96 tasks, showing that our method is sample efficient and fast to coverage.
  • 5.3 ROBUSTNESS
  • This experiment is to evaluate whether our method is robust to the noise, i.e., the follower randomly takes actions.
  • We make this experiment by introducing noise into the follower decision.
  • We observe that the total reward for the baseline method becomes lower with the increase of the noise while our method is more robust to the change.
  • For the incentive, we find that our method gains much more incentive than the state-of-the-art method, showing that our method coordinates have a better coordination the followers than the state-of-the-art method
Results
  • Predator-prey lections: based on (1), the leader can choose four bonuses levels; (3) modified navigation: followers are required to navigate some landmarks and after one of the landmarks is reached, the reached landmark disappears and new landmark will appear randomly.
Conclusion
  • REMARKS

    This paper proposes a novel RL training scheme for Stackelberg Markov Games with single leader and multiple self-interested followers, which considers the leader’s long-term decision process and complicated interaction between followers with three contributions. 1) To consider the long-term effect of the leader’s behavior, we develop an event-based policy gradient for the leader’s policy. 2) To predict the followers’ behaviors and make accurate response to their behaviors, we exploit the leader-follower consistency to design a novel follower-aware module and follower-specific attention mechanism. 3) We propose an action abstraction-based policy gradient algorithm to accelerate the training process of followers.
  • This paper proposes a novel RL training scheme for Stackelberg Markov Games with single leader and multiple self-interested followers, which considers the leader’s long-term decision process and complicated interaction between followers with three contributions.
  • SMGs provide a new scheme focusing more on the self-interested agents
  • We think this aspect is the most significant contribution to the RL community.
  • Our methods contribute to the hierarchical RL, i.e., it provides a non-cooperative training scheme between the high-level policy and the low-level policy, which plays an important role when the followers are self-interested.
  • Our EBPG propose an novel policy gradient method for the temporal abstraction structure
Summary
  • Introduction:

    Deep Multi-Agent Reinforcement Learning (MARL) has been widely used in coordinating cooperative agents to jointly complete certain tasks where the agent is assumed to be selfless, i.e., the agent is willing to sacrifice itself to maximize the team reward.
  • Forcing the drivers to selflessly contribute may increase the income for the company in a short-term but it will causes the low efficient and unsustainable of that company in the long run because the unsatisfied drivers may be demotivated and even leave the company
  • Another important example is that the government wants some companies to invest on the poverty area to achieve the fairness of the society, which may inevitably reduce the profits of companies.
  • We solve the large-scale sequential expensive coordination problem with a novel RL training scheme
  • Methods:

    Ours total incentive M3RL total incentive

    Ours total reward M3RL total reward

    0% Noise 18.32 4.06 10.06 -1.58

    30% Noise 17.63 3.85 5.36 -3.23

    50% Noise 17.28 4.02 5.30 -8.96 tasks, showing that our method is sample efficient and fast to coverage.
  • 5.3 ROBUSTNESS
  • This experiment is to evaluate whether our method is robust to the noise, i.e., the follower randomly takes actions.
  • We make this experiment by introducing noise into the follower decision.
  • We observe that the total reward for the baseline method becomes lower with the increase of the noise while our method is more robust to the change.
  • For the incentive, we find that our method gains much more incentive than the state-of-the-art method, showing that our method coordinates have a better coordination the followers than the state-of-the-art method
  • Results:

    Predator-prey lections: based on (1), the leader can choose four bonuses levels; (3) modified navigation: followers are required to navigate some landmarks and after one of the landmarks is reached, the reached landmark disappears and new landmark will appear randomly.
  • Conclusion:

    REMARKS

    This paper proposes a novel RL training scheme for Stackelberg Markov Games with single leader and multiple self-interested followers, which considers the leader’s long-term decision process and complicated interaction between followers with three contributions. 1) To consider the long-term effect of the leader’s behavior, we develop an event-based policy gradient for the leader’s policy. 2) To predict the followers’ behaviors and make accurate response to their behaviors, we exploit the leader-follower consistency to design a novel follower-aware module and follower-specific attention mechanism. 3) We propose an action abstraction-based policy gradient algorithm to accelerate the training process of followers.
  • This paper proposes a novel RL training scheme for Stackelberg Markov Games with single leader and multiple self-interested followers, which considers the leader’s long-term decision process and complicated interaction between followers with three contributions.
  • SMGs provide a new scheme focusing more on the self-interested agents
  • We think this aspect is the most significant contribution to the RL community.
  • Our methods contribute to the hierarchical RL, i.e., it provides a non-cooperative training scheme between the high-level policy and the low-level policy, which plays an important role when the followers are self-interested.
  • Our EBPG propose an novel policy gradient method for the temporal abstraction structure
Tables
  • Table1: Robustness results in multi-bound resource collections
  • Table2: Ablation results of RL-based followers for resource collections. ’ ’ means the module is used
Download tables as Excel
Funding
  • This research is supported by NRF AISG-RP-2019-0013, NSOE-TSS2019-01, MOE and NTU. Also, this work is supported by the National Natural Science Foundation of China under Grants (U1611264, U1811261,61602530, 61772567, U1811262 and U1711262)
  • This work is also supported by the Pearl River Nova Program of Guangzhou (201906010080). We would like to thank Tianming Shu, Darren Chua, Suming Yu, Enrique Munoz de Cote, and Xu He for their kind suggestions and helps
Reference
  • Andras Antos, Csaba Szepesvari, and Remi Munos. Fitted Q-iteration in continuous action-space mdps. In NeurIPS, pp. 9–16, 2008.
    Google ScholarLocate open access versionFindings
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pp. 1726– 1734, 2017.
    Google ScholarLocate open access versionFindings
  • Raphen Becker, Shlomo Zilberstein, Victor Lesser, and Claudia V Goldman. Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22:423–455, 2004.
    Google ScholarLocate open access versionFindings
  • Vivek S Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5): 291–294, 1997.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456): 885–890, 2019.
    Google ScholarLocate open access versionFindings
  • Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, and Philip S Thomas. Learning action representations for reinforcement learning. arXiv preprint arXiv:1902.00183, 2019.
    Findings
  • Changan Chen, Yuejiang Liu, Sven Kreiss, and Alexandre Alahi. Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. arXiv preprint arXiv:1809.08835, 2018.
    Findings
  • Chi Cheng, Zhangqing Zhu, Bo Xin, and Chunlin Chen. A multi-agent reinforcement learning algorithm based on Stackelberg game. In DDCLS, pp. 727–732, 2017.
    Google ScholarLocate open access versionFindings
  • Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(2-3):337–357, 2016.
    Google ScholarLocate open access versionFindings
  • Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. Convergence of learning dynamics in Stackelberg games. arXiv preprint arXiv:1906.01217, 2019.
    Findings
  • Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In AAMAS, pp. 122–130, 2018.
    Google ScholarLocate open access versionFindings
  • FA Gers, J Schmidhuber, and F Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451, 2000.
    Google ScholarLocate open access versionFindings
  • Tarun Gupta, Akshat Kumar, and Praveen Paruchuri. Planning and learning for decentralized MDPs with event driven rewards. In AAAI, pp. 6186–6194, 2018.
    Google ScholarLocate open access versionFindings
  • He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daume III. Opponent modeling in deep reinforcement learning. In ICML, pp. 1804–1813, 2016.
    Google ScholarLocate open access versionFindings
  • Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In ICML, pp. 2961–2970, 2019.
    Google ScholarLocate open access versionFindings
  • Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, pp. 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Jean-Jacques Laffont and David Martimort. The Theory of Incentives: The Principal-Agent Model. Princeton University Press, 2009.
    Google ScholarFindings
  • Julien Laumonier and Brahim Chaib-draa. Multiagent q-learning: Preliminary study on dominance between the nash and stackelberg equilibriums. In aAAi workshop, 2005.
    Google ScholarLocate open access versionFindings
  • Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, pp. 6379–6390, 2017.
    Google ScholarLocate open access versionFindings
  • David Mguni, Joel Jennings, Emilio Sison, Sergio Valcarcel Macua, Sofia Ceppi, and Enrique Munoz de Cote. Coordinating the crowd: Inducing desirable equilibria in non-cooperative systems. In AAMAS, pp. 386–394, 2019.
    Google ScholarLocate open access versionFindings
  • Fei Miao, Shuo Han, Shan Lin, John A Stankovic, Desheng Zhang, Sirajum Munir, Hua Huang, Tian He, and George J Pappas. Taxi dispatch with real-time sensing data in metropolitan areas: A receding horizon control approach. IEEE Transactions on Automation Science and Engineering, 13(2):463–478, 2016.
    Google ScholarLocate open access versionFindings
  • Noam Nisan and Amir Ronen. Algorithmic mechanism design. Games and Economic Behavior, 35 (1-2):166–196, 2001.
    Google ScholarFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Autodiff Workshop NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew Botvinick. Machine theory of mind. In International Conference on Machine Learning, pp. 4215–4224, 2018.
    Google ScholarLocate open access versionFindings
  • Regis Sabbadin and Anne-France Viet. A tractable leader-follower MDP model for animal disease management. In AAAI, pp. 1320–1326, 2013.
    Google ScholarLocate open access versionFindings
  • Regis Sabbadin and Anne-France Viet. Leader-follower MDP models with factored state space and many followers-followers abstraction, structured dynamics and state aggregation. In ECAI, pp. 116–124, 2016.
    Google ScholarLocate open access versionFindings
  • Weiran Shen, Binghui Peng, Hanpeng Liu, Michael Zhang, Ruohan Qian, Yan Hong, Zhi Guo, Zongyao Ding, Pengjun Lu, and Pingzhong Tang. Reinforcement mechanism design, with applications to dynamic pricing in sponsored search auctions. arXiv preprint arXiv:1711.10279, 2017.
    Findings
  • Tianmin Shu and Yuandong Tian. M3RL: Mind-aware multi-agent management reinforcement learning. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Matthew Smith, Herke Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In ICML, pp. 4710–4719, 2018.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. Introduction to Reinforcement Learning. MIT Press Cambridge, 1998.
    Google ScholarFindings
  • Richard S Sutton, Doina Precup, and Satinder P Singh. Intra-option learning about temporally abstract actions. In ICML, pp. 556–564, 1998.
    Google ScholarLocate open access versionFindings
  • Pingzhong Tang. Reinforcement mechanism design. In IJCAI, pp. 26–30, 2017.
    Google ScholarLocate open access versionFindings
  • Kurian Tharakunnel and Siddhartha Bhattacharyya. Leader-follower semi-Markov decision problems: Theoretical framework and approximate solution. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 111–118, 2007.
    Google ScholarLocate open access versionFindings
  • Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A generalised method for empirical game theoretic analysis. In AAMAS, pp. 77–85, 2018.
    Google ScholarLocate open access versionFindings
  • Utkarsh Upadhyay, Abir De, and Manuel Gomez Rodriguez. Deep reinforcement learning of marked temporal point processes. In NeurIPS, pp. 3168–3178, 2018.
    Google ScholarLocate open access versionFindings
  • Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. Strategic attentive writer for learning macro-actions. In NeurIPS, pp. 3486–3494, 2016.
    Google ScholarLocate open access versionFindings
  • Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Learning and transferring roles in multi-agent reinforcement. In Proc. AAAI-08 Workshop on Transfer Learning for Complex Tasks, 2008.
    Google ScholarLocate open access versionFindings
  • Published as a conference paper at ICLR 2020 Shangtong Zhang and Shimon Whiteson. DAC: The double actor-critic architecture for learning options. arXiv preprint arXiv:1904.12691, 2019. Yan Zheng, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, and Changjie Fan. A
    Findings
  • deep Bayesian policy reuse approach against non-stationary agents. In NeurIPS, pp. 954–964, 2018.
    Google ScholarLocate open access versionFindings
  • The equation above is exactly the REINFORCE trick (Sutton & Barto, 1998) and the rule of derivations. The approximation indicates that one trajectory only has one AT 3. Also based on the definition of eki and ekj, the equation can be rewritten in a more compact form:
    Google ScholarLocate open access versionFindings
  • We find that Upadhyay et al. (2018) also implement this approximation but use different explanations.
    Google ScholarLocate open access versionFindings
  • Remark 1. Some researches also focus on event-based RL but either on single-agent continuous time (Upadhyay et al., 2018) or reward representation (Gupta et al., 2018). We are the first to develop and implement the event-based policy gradient into the multi-agent system.
    Google ScholarLocate open access versionFindings
  • This Lemma is similar to the assumption in (Mguni et al., 2019). We prove it rather than make it an assumption. Lemma 2. If Assumption 1 is satisfied, the inequality is established:
    Google ScholarLocate open access versionFindings
  • Where C = (1 − γ)−1C.The last equation is drawn form the Assumption 1 and the inequality of a geometric series: |(I − γpπ)−1| ≤ (1 − γ)−1. Some parts follow the same logic of (Bacon et al., 2017; Mguni et al., 2019; Kakade & Langford, 2002).
    Google ScholarLocate open access versionFindings
  • We adopt the idea of successor representation (Rabinowitz et al., 2018; Shu & Tian, 2019) as two expected baseline functions: φg(ct) and φb(ct). For the gain baseline function: φg =
    Google ScholarLocate open access versionFindings
  • We also leverage the imitation learning to learn the action probability function pk akt |st, hkt, θI similar to (Shu & Tian, 2019), where θI is the parameters for follower-aware module: LIL = E
    Google ScholarFindings
  • (a) Resource Collections & Multi-bonus Resource Collections. This figure is inspired by (Shu & Tian, 2019).
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments