AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
As the Leaky Marginal Value Theorem emerges from the R-learning, we argue that the R-learning has a potential to offer an optimal strategy for sequential stay-or-leave decisions in real-world conditions

R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

NIPS 2020, (2020)

被引用0|浏览101
EI
下载 PDF 全文
引用
微博一下

摘要

When should you continue with your ongoing plans and when should you instead decide to pursue better opportunities? We show in theory and experiment that such stay-or-leave decisions are consistent with deep R-learning both behaviorally and neuronally. Our results suggest that real-world agents leave depleting resources when their reward ...更多

代码

数据

0
简介
  • In everyday life the authors repeatedly face sequential stay-or-leave decisions. These decisions include time investment, employment, entertainment and other choices in settings where rewards decrease over time.
  • A few studies have explored sequential stay-or-leave decisions in humans, or rodents – the model organism used to access neuronal activity at high resolution.
  • In both cases, decision patterns were collected in foraging tasks – the experimental settings where subjects decide when to leave depleting resources (2).
  • Reward options were represented by multiple sources of primary rewards, decreasing in size or probability over time to model natural resource depletion (2; 3)
重点内容
  • In everyday life we repeatedly face sequential stay-or-leave decisions
  • We developed foraging tasks in which animals navigated between multiple sources of depleting rewards
  • We propose that real-world agents compare the expected reward to an exponential average of past rewards – the decision rule we named the Leaky MVT for similarity with the conclusions of the Marginal Value Theorem (MVT)
  • We show that individual stay-or-leave decisions – and dopaminergic neuronal firing in the ventral tegmental area (VTA) of the animals – are consistent with the R-learning, an reinforcement learning (RL) paradigm maximizing the difference between the expected and exponentially averaged rewards, aiming to behave better than on the average
  • We further derived the Leaky MVT – a novel decision rule based on exponential filtering of past rewards
  • As the Leaky MVT emerges from the R-learning (Appendix A3), we argue that the R-learning has a potential to offer an optimal strategy for sequential stay-or-leave decisions in real-world conditions
结果
  • 2.1 Sequential foraging decisions reveal stay-or-leave choice modulation in mice

    The goal of this work was to identify the mechanism how real-world agents learn to make sequential stay-or-leave decisions in the context of depleting resources.
  • 2.1 Sequential foraging decisions reveal stay-or-leave choice modulation in mice.
  • The goal of this work was to identify the mechanism how real-world agents learn to make sequential stay-or-leave decisions in the context of depleting resources.
  • To pursue this goal, the authors developed foraging tasks in which animals navigated between multiple sources of depleting rewards.
结论
  • In real-world conditions, the authors often face sequential stay-or-leave decisions about whether to engage with the current option, or to search for a better one.
  • The authors further derived the Leaky MVT – a novel decision rule based on exponential filtering of past rewards.
  • The authors show that this rule is implemented by R-learning (Appendix A3) and accounts for animals’ behavior in the tasks.
  • The authors discuss how these findings connect to decision-making and learning in real-world agents
基金
  • Funding in direct support of this work: The Swartz Foundation; DFG Grant STA 1544/1-1
研究对象与分析
mice: 7
In case of V-learning, κ = 1 results in leaving ports at a same threshold regardless the initial reward value. Additionally, to compare the MVT and the Leaky MVT quantitatively, we performed the parameter fitting for both models using behavior patterns of 7 mice observed in the “random initial rewards” task. We minimized the negative log likelihood computed over the models’ predictions (Appendix A5) w.r.t the parameters of the models

引用论文
  • Nils Kolling and Thomas Akam. (reinforcement?) learning to forage optimally. Current opinion in neurobiology, 46:162–169, 2017.
    Google ScholarLocate open access versionFindings
  • Sara M Constantino and Nathaniel D Daw. Learning the opportunity cost of time in a patch-foraging task. Cognitive, Affective, & Behavioral Neuroscience, 15(4):837–853, 2015.
    Google ScholarLocate open access versionFindings
  • Eran Lottem, Dhruba Banerjee, Pietro Vertechi, Dario Sarra, Matthijs oude Lohuis, and Zachary F Mainen. Activation of serotonin neurons promotes active persistence in a probabilistic foraging task. Nature communications, 9(1):1–12, 2018.
    Google ScholarLocate open access versionFindings
  • Eric L Charnov et al. Optimal foraging, the marginal value theorem. 1976.
    Google ScholarLocate open access versionFindings
  • Jacob D Davidson and Ahmed El Hady. Foraging as an evidence accumulation process. PLoS computational biology, 15(7):e1007060, 2019.
    Google ScholarLocate open access versionFindings
  • Jane X Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Demis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement learning system. Nature neuroscience, 21(6):860–868, 2018.
    Google ScholarLocate open access versionFindings
  • Robb B Rutledge, Stephanie C Lazzaro, Brian Lau, Catherine E Myers, Mark A Gluck, and Paul W Glimcher. Dopaminergic drugs modulate learning rates and perseveration in parkinson’s patients in a dynamic foraging task. Journal of Neuroscience, 29(48):15104–15114, 2009.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
    Google ScholarLocate open access versionFindings
  • Peter Dayan and Laurence F Abbott. Theoretical neuroscience: computational and mathematical modeling of neural systems. 2001.
    Google ScholarFindings
  • Daeyeol Lee, Hyojung Seo, and Min Whan Jung. Neural basis of reinforcement learning and decision making. Annual review of neuroscience, 35:287–308, 2012.
    Google ScholarLocate open access versionFindings
  • Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997.
    Google ScholarLocate open access versionFindings
  • Wolfram Schultz. Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80(1):1–27, 1998.
    Google ScholarLocate open access versionFindings
  • Paul W Glimcher. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 108(Supplement 3):15647–15654, 2011.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Kenji Morita, Mieko Morishima, Katsuyuki Sakai, and Yasuo Kawaguchi. Reinforcement learning: computing the temporal difference of values via distinct corticostriatal pathways. Trends in neurosciences, 35(8):457–467, 2012.
    Google ScholarLocate open access versionFindings
  • Daphna Joel, Yael Niv, and Eytan Ruppin. Actor–critic models of the basal ganglia: New anatomical and computational perspectives. Neural networks, 15(4-6):535–547, 2002.
    Google ScholarLocate open access versionFindings
  • Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning, volume 298, pages 298–305, 1993.
    Google ScholarLocate open access versionFindings
  • Brian Lau and Paul W Glimcher. Dynamic response-by-response models of matching behavior in rhesus monkeys. Journal of the experimental analysis of behavior, 84(3):555–579, 2005.
    Google ScholarLocate open access versionFindings
  • Kevin Lloyd and Peter Dayan. Tamping ramping: algorithmic, implementational, and computational explanations of phasic dopamine signals in the accumbens. PLoS computational biology, 11(12):e1004622, 2015.
    Google ScholarLocate open access versionFindings
  • D. J. Barraclough, M. L. Conroy, and D. Lee. Prefrontal cortex and decision making in a mixed-strategy game. Nat Neurosci, 7(4):404–10, 2004.
    Google ScholarLocate open access versionFindings
  • J. N. Kim and M. N. Shadlen. Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nat Neurosci, 2(2):176–85, 1999.
    Google ScholarLocate open access versionFindings
  • Nathaniel D Daw, Yael Niv, and Peter Dayan. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704–1711, 2005.
    Google ScholarLocate open access versionFindings
  • Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way. In Advances in neural information processing systems, pages 889–896, 2008.
    Google ScholarLocate open access versionFindings
  • Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
    Findings
  • Alex T Piet, Ahmed El Hady, and Carlos D Brody. Rats adopt the optimal timescale for evidence integration in a dynamic environment. Nature communications, 9(1):1–12, 2018.
    Google ScholarLocate open access versionFindings
  • Angela J Yu and Jonathan D Cohen. Sequential effects: superstition or rational behavior? In Advances in neural information processing systems, pages 1873–1880, 2009.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton. Gain adaptation beats least squares. In Proceedings of the 7th Yale workshop on adaptive and learning systems, volume 161168, 1992.
    Google ScholarLocate open access versionFindings
  • Nathaniel D Daw, Sham Kakade, and Peter Dayan. Opponent interactions between serotonin and dopamine. Neural networks, 15(4-6):603–616, 2002.
    Google ScholarLocate open access versionFindings
  • Yael Niv, Nathaniel D Daw, Daphna Joel, and Peter Dayan. Tonic dopamine: opportunity costs and the control of response vigor. Psychopharmacology, 191(3):507–520, 2007.
    Google ScholarLocate open access versionFindings
  • Nicolas Schweighofer and Kenji Doya. Meta-learning in reinforcement learning. Neural Networks, 16(1):5–9, 2003.
    Google ScholarLocate open access versionFindings
  • Sergey A Shuvaev, Ngoc B Tran, Marcus Stephenson-Jones, Bo Li, and Alexei A Koulakov. Neural networks with motivation. arXiv preprint arXiv:1906.09528, 2019.
    Findings
作者
Sarah Starosta
Sarah Starosta
Duda Kvitsiani
Duda Kvitsiani
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科