As the Leaky Marginal Value Theorem emerges from the R-learning, we argue that the R-learning has a potential to offer an optimal strategy for sequential stay-or-leave decisions in real-world conditions
R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making
NIPS 2020, (2020)
When should you continue with your ongoing plans and when should you instead decide to pursue better opportunities? We show in theory and experiment that such stay-or-leave decisions are consistent with deep R-learning both behaviorally and neuronally. Our results suggest that real-world agents leave depleting resources when their reward ...更多
下载 PDF 全文
- In everyday life the authors repeatedly face sequential stay-or-leave decisions. These decisions include time investment, employment, entertainment and other choices in settings where rewards decrease over time.
- A few studies have explored sequential stay-or-leave decisions in humans, or rodents – the model organism used to access neuronal activity at high resolution.
- In both cases, decision patterns were collected in foraging tasks – the experimental settings where subjects decide when to leave depleting resources (2).
- Reward options were represented by multiple sources of primary rewards, decreasing in size or probability over time to model natural resource depletion (2; 3)
- In everyday life we repeatedly face sequential stay-or-leave decisions
- We developed foraging tasks in which animals navigated between multiple sources of depleting rewards
- We propose that real-world agents compare the expected reward to an exponential average of past rewards – the decision rule we named the Leaky MVT for similarity with the conclusions of the Marginal Value Theorem (MVT)
- We show that individual stay-or-leave decisions – and dopaminergic neuronal firing in the ventral tegmental area (VTA) of the animals – are consistent with the R-learning, an reinforcement learning (RL) paradigm maximizing the difference between the expected and exponentially averaged rewards, aiming to behave better than on the average
- We further derived the Leaky MVT – a novel decision rule based on exponential filtering of past rewards
- As the Leaky MVT emerges from the R-learning (Appendix A3), we argue that the R-learning has a potential to offer an optimal strategy for sequential stay-or-leave decisions in real-world conditions
- 2.1 Sequential foraging decisions reveal stay-or-leave choice modulation in mice
The goal of this work was to identify the mechanism how real-world agents learn to make sequential stay-or-leave decisions in the context of depleting resources.
- 2.1 Sequential foraging decisions reveal stay-or-leave choice modulation in mice.
- The goal of this work was to identify the mechanism how real-world agents learn to make sequential stay-or-leave decisions in the context of depleting resources.
- To pursue this goal, the authors developed foraging tasks in which animals navigated between multiple sources of depleting rewards.
- In real-world conditions, the authors often face sequential stay-or-leave decisions about whether to engage with the current option, or to search for a better one.
- The authors further derived the Leaky MVT – a novel decision rule based on exponential filtering of past rewards.
- The authors show that this rule is implemented by R-learning (Appendix A3) and accounts for animals’ behavior in the tasks.
- The authors discuss how these findings connect to decision-making and learning in real-world agents
- Funding in direct support of this work: The Swartz Foundation; DFG Grant STA 1544/1-1
In case of V-learning, κ = 1 results in leaving ports at a same threshold regardless the initial reward value. Additionally, to compare the MVT and the Leaky MVT quantitatively, we performed the parameter fitting for both models using behavior patterns of 7 mice observed in the “random initial rewards” task. We minimized the negative log likelihood computed over the models’ predictions (Appendix A5) w.r.t the parameters of the models
- Nils Kolling and Thomas Akam. (reinforcement?) learning to forage optimally. Current opinion in neurobiology, 46:162–169, 2017.
- Sara M Constantino and Nathaniel D Daw. Learning the opportunity cost of time in a patch-foraging task. Cognitive, Affective, & Behavioral Neuroscience, 15(4):837–853, 2015.
- Eran Lottem, Dhruba Banerjee, Pietro Vertechi, Dario Sarra, Matthijs oude Lohuis, and Zachary F Mainen. Activation of serotonin neurons promotes active persistence in a probabilistic foraging task. Nature communications, 9(1):1–12, 2018.
- Eric L Charnov et al. Optimal foraging, the marginal value theorem. 1976.
- Jacob D Davidson and Ahmed El Hady. Foraging as an evidence accumulation process. PLoS computational biology, 15(7):e1007060, 2019.
- Jane X Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Demis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement learning system. Nature neuroscience, 21(6):860–868, 2018.
- Robb B Rutledge, Stephanie C Lazzaro, Brian Lau, Catherine E Myers, Mark A Gluck, and Paul W Glimcher. Dopaminergic drugs modulate learning rates and perseveration in parkinson’s patients in a dynamic foraging task. Journal of Neuroscience, 29(48):15104–15114, 2009.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
- Peter Dayan and Laurence F Abbott. Theoretical neuroscience: computational and mathematical modeling of neural systems. 2001.
- Daeyeol Lee, Hyojung Seo, and Min Whan Jung. Neural basis of reinforcement learning and decision making. Annual review of neuroscience, 35:287–308, 2012.
- Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997.
- Wolfram Schultz. Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80(1):1–27, 1998.
- Paul W Glimcher. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 108(Supplement 3):15647–15654, 2011.
- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Kenji Morita, Mieko Morishima, Katsuyuki Sakai, and Yasuo Kawaguchi. Reinforcement learning: computing the temporal difference of values via distinct corticostriatal pathways. Trends in neurosciences, 35(8):457–467, 2012.
- Daphna Joel, Yael Niv, and Eytan Ruppin. Actor–critic models of the basal ganglia: New anatomical and computational perspectives. Neural networks, 15(4-6):535–547, 2002.
- Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning, volume 298, pages 298–305, 1993.
- Brian Lau and Paul W Glimcher. Dynamic response-by-response models of matching behavior in rhesus monkeys. Journal of the experimental analysis of behavior, 84(3):555–579, 2005.
- Kevin Lloyd and Peter Dayan. Tamping ramping: algorithmic, implementational, and computational explanations of phasic dopamine signals in the accumbens. PLoS computational biology, 11(12):e1004622, 2015.
- D. J. Barraclough, M. L. Conroy, and D. Lee. Prefrontal cortex and decision making in a mixed-strategy game. Nat Neurosci, 7(4):404–10, 2004.
- J. N. Kim and M. N. Shadlen. Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nat Neurosci, 2(2):176–85, 1999.
- Nathaniel D Daw, Yael Niv, and Peter Dayan. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704–1711, 2005.
- Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way. In Advances in neural information processing systems, pages 889–896, 2008.
- Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Alex T Piet, Ahmed El Hady, and Carlos D Brody. Rats adopt the optimal timescale for evidence integration in a dynamic environment. Nature communications, 9(1):1–12, 2018.
- Angela J Yu and Jonathan D Cohen. Sequential effects: superstition or rational behavior? In Advances in neural information processing systems, pages 1873–1880, 2009.
- Richard S Sutton. Gain adaptation beats least squares. In Proceedings of the 7th Yale workshop on adaptive and learning systems, volume 161168, 1992.
- Nathaniel D Daw, Sham Kakade, and Peter Dayan. Opponent interactions between serotonin and dopamine. Neural networks, 15(4-6):603–616, 2002.
- Yael Niv, Nathaniel D Daw, Daphna Joel, and Peter Dayan. Tonic dopamine: opportunity costs and the control of response vigor. Psychopharmacology, 191(3):507–520, 2007.
- Nicolas Schweighofer and Kenji Doya. Meta-learning in reinforcement learning. Neural Networks, 16(1):5–9, 2003.
- Sergey A Shuvaev, Ngoc B Tran, Marcus Stephenson-Jones, Bo Li, and Alexei A Koulakov. Neural networks with motivation. arXiv preprint arXiv:1906.09528, 2019.