# Asynchronous Coagent Networks

ICML, pp. 5426-5435, 2020.

EI

Weibo:

Abstract:

Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks. In this work, we prove that CPGAs converge to locally optimal policies. Additionally, we extend prior theory to encompass asynchronous and recurrent coagent networks. These extensions...More

Code:

Data:

Introduction

- Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs).
- In this paper the authors study the problem of deriving learning rules for RL agents with SNN policies.
- Coagent networks are one formulation of SNN policies for RL agents (Thomas & Barto, 2011).
- At, and Rt are the state, action, and reward at time t, and are random variables that take values in S, A, and R, respectively.
- An episode is a sequence of states, actions, and rewards, starting from t=0 and continuing indefinitely.
- The authors assume that the discounted sum of rewards over an episode is finite

Highlights

- Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs)
- Coagent networks are comprised of conjugate agents, or coagents; each coagent is an Reinforcement learning algorithm learning and acting cooperatively with the other coagents in its network
- We focus on the case where each coagent is a policy gradient Reinforcement learning algorithm, and call the resulting algorithms coagent policy gradient algorithms (CPGAs)
- Using the option-critic framework as an example, we have shown that the Asynchronous Coagent Policy Gradient Theorem is a useful tool for analyzing arbitrary stochastic networks
- We provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting
- Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates

Conclusion

- The authors provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting.
- The authors empirically support the CPGT, and use the option-critic framework as an example to show how the approach facilitates and simplifies gradient derivation for arbitrary stochastic networks.
- Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates

Summary

## Introduction:

Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs).- In this paper the authors study the problem of deriving learning rules for RL agents with SNN policies.
- Coagent networks are one formulation of SNN policies for RL agents (Thomas & Barto, 2011).
- At, and Rt are the state, action, and reward at time t, and are random variables that take values in S, A, and R, respectively.
- An episode is a sequence of states, actions, and rewards, starting from t=0 and continuing indefinitely.
- The authors assume that the discounted sum of rewards over an episode is finite
## Conclusion:

The authors provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting.- The authors empirically support the CPGT, and use the option-critic framework as an example to show how the approach facilitates and simplifies gradient derivation for arbitrary stochastic networks.
- Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates

Related work

- Klopf (1982) theorized that traditional models of classical and operant conditioning could be explained by modeling biological neurons as hedonistic, that is, seeking excitation and avoiding inhibition. The ideas motivating coagent networks bear a deep resemblance to Klopf’s proposal.

Stochastic neural networks have applications dating back at least to Marvin Minsky’s stochastic neural analog reinforcement calculator, built in 1951 (Russell & Norvig, 2016). Research of stochastic learning automata continued this work (Narendra & Thathachar, 1989); one notable example is the adaptive reward-penalty learning rule for training stochastic networks (Barto, 1985). Similarly, Williams (1992) proposed the well-known REINFORCE algorithm with the intent of training stochastic networks. Since then, REINFORCE has primarily been applied to deterministic networks. However, Thomas (2011) proposed CPGAs for RL, building on the original intent of Williams (1992).

Funding

- Research reported in this paper was sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)

Reference

- Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Barto, A. G. Learning by statistical cooperation of selfinterested neuron-like computing elements. Human Neurobiology, 4(4):229–256, 1985.
- Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- Guestrin, C., Lagoudakis, M., and Parr, R. Coordinated reinforcement learning. In ICML, volume 2, pp. 227–234, 2002.
- Klopf, A. H. The hedonistic neuron: A theory of memory, learning, and intelligence. Toxicology-Sci, 1982.
- Liu, B., Singh, S., Lewis, R. L., and Qin, S. Optimal rewards for cooperative agents. IEEE Transactions on Autonomous Mental Development, 6(4):286–297, 2014.
- Narendra, K. S. and Thathachar, M. A. Learning Automata: an Introduction. Prentice-Hall, Inc., 1989.
- Riemer, M., Liu, M., and Tesauro, G. Learning abstract options. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), NeurIPS 31, pp. 10424–10434. Curran Associates, Inc., 201URL http://papers.nips.cc/paper/8243-learning-abstract-options.pdf.
- Russell, S. J. and Norvig, P. Artificial Intelligence: A Modern Approach. Malaysia; Pearson Education Limited„ 2016.
- Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradient estimation using stochastic computation graphs. In NeurIPS, pp. 3528–3536, 2015.
- Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2): 181–211, 1999.
- Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., and White, A. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2011), pp. 761–768, 2011.
- Thomas, P. S. Policy gradient coagent networks. In NeurIPS, pp. 1944–1952, 2011.
- Thomas, P. S. and Barto, A. G. Conjugate markov decision processes. In ICML, pp. 137–144, 2011.
- Thomas, P. S. and Barto, A. G. Motor primitive discovery. In Procedings of the IEEE Conference on Development and Learning and Epigenetic Robotics, pp. 1–8, 2012.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992.
- Zhang, S. and Whiteson, S. Dac: The double actorcritic architecture for learning options. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), NeurIPS 32, pp. 2010–2020. Curran Associates, Inc., 2019.
- Zhang, X., Aberdeen, D., and Vishwanathan, S. Conditional random fields for multi-agent reinforcement learning. In ICML, pp. 1143–1150. ACM, 2007.

Full Text

Tags

Comments