Asynchronous Coagent Networks

James Kostas
James Kostas
Chris Nota
Chris Nota

ICML, pp. 5426-5435, 2020.

Cited by: 2|Bibtex|Views9
EI
Other Links: dblp.uni-trier.de
Weibo:
We provide a formal and general proof of the coagent policy gradient theorem for stochastic policy networks, and extend it to the asynchronous and recurrent setting

Abstract:

Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks. In this work, we prove that CPGAs converge to locally optimal policies. Additionally, we extend prior theory to encompass asynchronous and recurrent coagent networks. These extensions...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs).
  • In this paper the authors study the problem of deriving learning rules for RL agents with SNN policies.
  • Coagent networks are one formulation of SNN policies for RL agents (Thomas & Barto, 2011).
  • At, and Rt are the state, action, and reward at time t, and are random variables that take values in S, A, and R, respectively.
  • An episode is a sequence of states, actions, and rewards, starting from t=0 and continuing indefinitely.
  • The authors assume that the discounted sum of rewards over an episode is finite
Highlights
  • Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs)
  • Coagent networks are comprised of conjugate agents, or coagents; each coagent is an Reinforcement learning algorithm learning and acting cooperatively with the other coagents in its network
  • We focus on the case where each coagent is a policy gradient Reinforcement learning algorithm, and call the resulting algorithms coagent policy gradient algorithms (CPGAs)
  • Using the option-critic framework as an example, we have shown that the Asynchronous Coagent Policy Gradient Theorem is a useful tool for analyzing arbitrary stochastic networks
  • We provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting
  • Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates
Conclusion
  • The authors provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting.
  • The authors empirically support the CPGT, and use the option-critic framework as an example to show how the approach facilitates and simplifies gradient derivation for arbitrary stochastic networks.
  • Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates
Summary
  • Introduction:

    Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs).
  • In this paper the authors study the problem of deriving learning rules for RL agents with SNN policies.
  • Coagent networks are one formulation of SNN policies for RL agents (Thomas & Barto, 2011).
  • At, and Rt are the state, action, and reward at time t, and are random variables that take values in S, A, and R, respectively.
  • An episode is a sequence of states, actions, and rewards, starting from t=0 and continuing indefinitely.
  • The authors assume that the discounted sum of rewards over an episode is finite
  • Conclusion:

    The authors provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting.
  • The authors empirically support the CPGT, and use the option-critic framework as an example to show how the approach facilitates and simplifies gradient derivation for arbitrary stochastic networks.
  • Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates
Related work
  • Klopf (1982) theorized that traditional models of classical and operant conditioning could be explained by modeling biological neurons as hedonistic, that is, seeking excitation and avoiding inhibition. The ideas motivating coagent networks bear a deep resemblance to Klopf’s proposal.

    Stochastic neural networks have applications dating back at least to Marvin Minsky’s stochastic neural analog reinforcement calculator, built in 1951 (Russell & Norvig, 2016). Research of stochastic learning automata continued this work (Narendra & Thathachar, 1989); one notable example is the adaptive reward-penalty learning rule for training stochastic networks (Barto, 1985). Similarly, Williams (1992) proposed the well-known REINFORCE algorithm with the intent of training stochastic networks. Since then, REINFORCE has primarily been applied to deterministic networks. However, Thomas (2011) proposed CPGAs for RL, building on the original intent of Williams (1992).
Funding
  • Research reported in this paper was sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)
Reference
  • Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Barto, A. G. Learning by statistical cooperation of selfinterested neuron-like computing elements. Human Neurobiology, 4(4):229–256, 1985.
    Google ScholarLocate open access versionFindings
  • Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
    Google ScholarLocate open access versionFindings
  • Guestrin, C., Lagoudakis, M., and Parr, R. Coordinated reinforcement learning. In ICML, volume 2, pp. 227–234, 2002.
    Google ScholarLocate open access versionFindings
  • Klopf, A. H. The hedonistic neuron: A theory of memory, learning, and intelligence. Toxicology-Sci, 1982.
    Google ScholarLocate open access versionFindings
  • Liu, B., Singh, S., Lewis, R. L., and Qin, S. Optimal rewards for cooperative agents. IEEE Transactions on Autonomous Mental Development, 6(4):286–297, 2014.
    Google ScholarLocate open access versionFindings
  • Narendra, K. S. and Thathachar, M. A. Learning Automata: an Introduction. Prentice-Hall, Inc., 1989.
    Google ScholarFindings
  • Riemer, M., Liu, M., and Tesauro, G. Learning abstract options. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), NeurIPS 31, pp. 10424–10434. Curran Associates, Inc., 201URL http://papers.nips.cc/paper/8243-learning-abstract-options.pdf.
    Locate open access versionFindings
  • Russell, S. J. and Norvig, P. Artificial Intelligence: A Modern Approach. Malaysia; Pearson Education Limited„ 2016.
    Google ScholarFindings
  • Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradient estimation using stochastic computation graphs. In NeurIPS, pp. 3528–3536, 2015.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2): 181–211, 1999.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., and White, A. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2011), pp. 761–768, 2011.
    Google ScholarLocate open access versionFindings
  • Thomas, P. S. Policy gradient coagent networks. In NeurIPS, pp. 1944–1952, 2011.
    Google ScholarLocate open access versionFindings
  • Thomas, P. S. and Barto, A. G. Conjugate markov decision processes. In ICML, pp. 137–144, 2011.
    Google ScholarLocate open access versionFindings
  • Thomas, P. S. and Barto, A. G. Motor primitive discovery. In Procedings of the IEEE Conference on Development and Learning and Epigenetic Robotics, pp. 1–8, 2012.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Zhang, S. and Whiteson, S. Dac: The double actorcritic architecture for learning options. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), NeurIPS 32, pp. 2010–2020. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Zhang, X., Aberdeen, D., and Vishwanathan, S. Conditional random fields for multi-agent reinforcement learning. In ICML, pp. 1143–1150. ACM, 2007.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments