AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce the Laplace code: a local temporal difference code for distributional reinforcement learning that is representationally powerful and computationally straightforward

A local temporal difference code for distributional reinforcement learning

NIPS 2020, (2020)

Cited by: 0|Views122
EI
Full Text
Bibtex
Weibo

Abstract

Recent theoretical and experimental results suggest that the dopamine system implements distributional temporal difference backups, allowing learning of the entire distributions of the long-run values of states rather than just their expected values. However, the distributional codes explored so far rely on a complex imputation step which...More

Code:

Data:

0
Introduction
  • In the traditional Reinforcement Learning (RL) framework, agents make decisions by learning and maximizing the scalar values of states, which quantify the expected sums of discounted future rewards that will be encountered from those states [1].
  • In addition to the value distribution, it is possible to recover the temporal evolution of the immediate reward distribution from the code, by taking an inverse Laplace operator.
Highlights
  • In the traditional Reinforcement Learning (RL) framework, agents make decisions by learning and maximizing the scalar values of states, which quantify the expected sums of discounted future rewards that will be encountered from those states [1]
  • We developed a distributional RL code for value that has two major advantages over the one recently proposed by Dabney et al [10]: (1) it allows the agent to recover the value distributions and the temporal evolution of the reward distribution; and (2) it can be learned with a local learning rule, as opposed to the rule of Dabney et al which requires that a unit knows the states of other units to update its parameters
  • The latter property is appealing and realistic from a biological point of view and leads to a significantly simpler convergence analysis compared to non-local codes, for which convergence needs to be proven in a distributional sense
  • We showed that our distributional code can be computed linearly from an ensemble of systems computing the successor representation with different temporal discounts, a model proposed for the hippocampus [11, 22]
  • For the first two of these, dopamine neurons in the Ventral Tegmental Area (VTA) apparently code for varying reward magnitudes [19]; and it has been proposed that value functions and/or temporal difference (TD) prediction errors for different values of γ are arranged in a spatially organized manner along the dorso-ventral axis of the striatum [23, 24]
  • It is still unclear if time horizon is separated from temporal discount in the dopamine system, but some experimental results suggest a discrete coding of temporal horizon [27]
Results
  • If the resolution along the h-dimension is high enough, applying L−1 to the converging points of the TD backups in Eq 10 recovers {P}Tτ=0, the set of probability distributions of immediate rewards at all future timesteps until T, given that the current state is st.
  • As illustrated in Fig. 3, the Laplace code recovers a temporal map of the problem, indicating all the possible rewards at all future times.
  • The authors' local code recovers the value distribution and the temporal evolution of rewards {P}τ .
  • In Fig. 5d the authors compare flexibility to a horizon change between the Laplace code and the Expectile code, whose estimates need to re-converge to the new value distribution at s under the new horizon T .
  • The authors developed a distributional RL code for value that has two major advantages over the one recently proposed by Dabney et al [10]: (1) it allows the agent to recover the value distributions and the temporal evolution of the reward distribution; and (2) it can be learned with a local learning rule, as opposed to the rule of Dabney et al which requires that a unit knows the states of other units to update its parameters.
  • The authors' code decomposes value distributions and prediction errors across three separated dimensions: reward magnitude, temporal discounting and time horizon.
  • In a more general framework the convergence points of the Laplace code are a form of General Value Function [28], which has been proposed as a unifying system to learn about many different variables from the same line of experience [29].
Conclusion
  • If rewards always occur at the same time, and they are distributed according to the distribution D, Vh in the Laplace code effectively learns Er∼D[fh(r)], the expected value of fh(r) when rewards are distributed as D.
  • The additional error signals provided by the code could allow the system to learn richer representations than traditional RL [7], and possibly even richer than those learnt using an Expectile code, since hidden representations must distinguish between states with the same value distribution but different temporal evolution of the immediate reward distribution.
Summary
  • In the traditional Reinforcement Learning (RL) framework, agents make decisions by learning and maximizing the scalar values of states, which quantify the expected sums of discounted future rewards that will be encountered from those states [1].
  • In addition to the value distribution, it is possible to recover the temporal evolution of the immediate reward distribution from the code, by taking an inverse Laplace operator.
  • If the resolution along the h-dimension is high enough, applying L−1 to the converging points of the TD backups in Eq 10 recovers {P}Tτ=0, the set of probability distributions of immediate rewards at all future timesteps until T, given that the current state is st.
  • As illustrated in Fig. 3, the Laplace code recovers a temporal map of the problem, indicating all the possible rewards at all future times.
  • The authors' local code recovers the value distribution and the temporal evolution of rewards {P}τ .
  • In Fig. 5d the authors compare flexibility to a horizon change between the Laplace code and the Expectile code, whose estimates need to re-converge to the new value distribution at s under the new horizon T .
  • The authors developed a distributional RL code for value that has two major advantages over the one recently proposed by Dabney et al [10]: (1) it allows the agent to recover the value distributions and the temporal evolution of the reward distribution; and (2) it can be learned with a local learning rule, as opposed to the rule of Dabney et al which requires that a unit knows the states of other units to update its parameters.
  • The authors' code decomposes value distributions and prediction errors across three separated dimensions: reward magnitude, temporal discounting and time horizon.
  • In a more general framework the convergence points of the Laplace code are a form of General Value Function [28], which has been proposed as a unifying system to learn about many different variables from the same line of experience [29].
  • If rewards always occur at the same time, and they are distributed according to the distribution D, Vh in the Laplace code effectively learns Er∼D[fh(r)], the expected value of fh(r) when rewards are distributed as D.
  • The additional error signals provided by the code could allow the system to learn richer representations than traditional RL [7], and possibly even richer than those learnt using an Expectile code, since hidden representations must distinguish between states with the same value distribution but different temporal evolution of the immediate reward distribution.
Funding
  • Funding disclosure A.P. and P.T. were supported by SNF grant #315230_197296. P.D. was supported by the Max Planck Society and the Alexander von Humboldt Foundation. The authors have no conflict of interests.
Reference
  • R. S. Sutton, A. G. Barto, et al., Introduction to reinforcement learning, vol.
    Google ScholarLocate open access versionFindings
  • 135. MIT press Cambridge, 1998.
    Google ScholarFindings
  • [2] M. F. Rushworth and T. E. Behrens, “Choice, uncertainty and value in prefrontal and cingulate cortex,” Nature neuroscience, vol. 11, no. 4, p. 389, 2008.
    Google ScholarLocate open access versionFindings
  • [3] A. Kepecs and Z. F. Mainen, “A computational framework for the study of confidence in humans and animals,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 367, no. 1594, pp. 1322–1337, 2012.
    Google ScholarLocate open access versionFindings
  • [4] H. Stojic, J. L. Orquin, P. Dayan, R. J. Dolan, and M. Speekenbrink, “Uncertainty in learning, choice, and visual fixation,” Proceedings of the National Academy of Sciences, vol. 117, no. 6, pp. 3291–3300, 2020.
    Google ScholarLocate open access versionFindings
  • [5] T. E. Behrens, M. W. Woolrich, M. E. Walton, and M. F. Rushworth, “Learning the value of information in an uncertain world,” Nature neuroscience, vol. 10, no. 9, pp. 1214–1221, 2007.
    Google ScholarLocate open access versionFindings
  • [6] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458, JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • [7] C. Lyle, M. G. Bellemare, and P. S. Castro, “A comparative analysis of expected and distributional reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511, 2019.
    Google ScholarLocate open access versionFindings
  • [8] M. Rowland, M. G. Bellemare, W. Dabney, R. Munos, and Y. W. Teh, “An analysis of categorical distributional reinforcement learning,” arXiv preprint arXiv:1802.08163, 2018.
    Findings
  • [9] M. Rowland, R. Dadashi, S. Kumar, R. Munos, M. G. Bellemare, and W. Dabney, “Statistics and samples in distributional reinforcement learning,” arXiv preprint arXiv:1902.08102, 2019.
    Findings
  • [10] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather, D. Hassabis, R. Munos, and M. Botvinick, “A distributional code for value in dopamine-based reinforcement learning,” Nature, pp. 1–5, 2020.
    Google ScholarLocate open access versionFindings
  • [11] I. Momennejad and M. W. Howard, “Predicting the future with multi-scale successor representations,” BioRxiv, p. 449470, 2018.
    Google ScholarLocate open access versionFindings
  • [12] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
    Google ScholarLocate open access versionFindings
  • [13] W. Schultz, “Predictive reward signal of dopamine neurons,” Journal of neurophysiology, vol. 80, no. 1, pp. 1–27, 1998.
    Google ScholarLocate open access versionFindings
  • [14] P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems. Computational Neuroscience Series, 2001.
    Google ScholarLocate open access versionFindings
  • [15] D. J. Foster and M. A. Wilson, “Reverse replay of behavioural sequences in hippocampal place cells during the awake state,” Nature, vol. 440, no. 7084, pp. 680–683, 2006.
    Google ScholarLocate open access versionFindings
  • [16] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual review of neuroscience, vol. 24, no. 1, pp. 1193–1216, 2001.
    Google ScholarLocate open access versionFindings
  • [17] K. Louie, P. W. Glimcher, and R. Webb, “Adaptive neural coding: from biological to behavioral decision-making,” Current opinion in behavioral sciences, vol. 5, pp. 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • [18] S. Hong, B. N. Lundstrom, and A. L. Fairhall, “Intrinsic gain modulation and adaptive neural coding,” PLoS Computational Biology, vol. 4, no. 7, 2008.
    Google ScholarLocate open access versionFindings
  • [19] N. Eshel, J. Tian, M. Bukwich, and N. Uchida, “Dopamine neurons share common response function for reward prediction error,” Nature neuroscience, vol. 19, no. 3, p. 479, 2016.
    Google ScholarLocate open access versionFindings
  • [20] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • [21] P. Dayan, “Improving generalization for temporal difference learning: The successor representation,” Neural Computation, vol. 5, no. 4, pp. 613–624, 1993.
    Google ScholarLocate open access versionFindings
  • [22] I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman, “The successor representation in human reinforcement learning,” Nature Human Behaviour, vol. 1, no. 9, pp. 680–692, 2017.
    Google ScholarLocate open access versionFindings
  • [23] S. C. Tanaka, N. Schweighofer, S. Asahi, K. Shishida, Y. Okamoto, S. Yamawaki, and K. Doya, “Serotonin differentially regulates short-and long-term prediction of rewards in the ventral and dorsal striatum,” PloS one, vol. 2, no. 12, 2007.
    Google ScholarLocate open access versionFindings
  • [24] X. Cai, S. Kim, and D. Lee, “Heterogeneous coding of temporally discounted values in the dorsal and ventral striatum during intertemporal choice,” Neuron, vol. 69, no. 1, pp. 170–182, 2011.
    Google ScholarLocate open access versionFindings
  • [25] S. C. Tanaka, K. Shishida, N. Schweighofer, Y. Okamoto, S. Yamawaki, and K. Doya, “Serotonin affects association of aversive outcomes to past actions,” Journal of Neuroscience, vol. 29, no. 50, pp. 15669–15674, 2009.
    Google ScholarLocate open access versionFindings
  • [26] N. Schweighofer, S. C. Tanaka, and K. Doya, “Serotonin and the evaluation of future rewards: theory, experiments, and possible neural mechanisms,” Annals of the New York Academy of Sciences, vol. 1104, no. 1, pp. 289–300, 2007.
    Google ScholarLocate open access versionFindings
  • [27] E. S. Bromberg-Martin, M. Matsumoto, H. Nakahara, and O. Hikosaka, “Multiple timescales of memory in lateral habenula and dopamine neurons,” Neuron, vol. 67, no. 3, pp. 499–510, 2010.
    Google ScholarLocate open access versionFindings
  • [28] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,” in The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768, 2011.
    Google ScholarLocate open access versionFindings
  • [29] G. Comanici, D. Precup, A. Barreto, D. K. Toyama, E. Aygün, P. Hamel, S. Vezhnevets, S. Hou, and S. Mourad, “Knowledge representation for reinforcement learning using general value functions,” 2018.
    Google ScholarFindings
  • [30] B. Engelhard, J. Finkelstein, J. Cox, W. Fleming, H. J. Jang, S. Ornelas, S. A. Koay, S. Y. Thiberge, N. D. Daw, D. W. Tank, et al., “Specialized coding of sensory, motor and cognitive variables in vta dopamine neurons,” Nature, vol. 570, no. 7762, pp. 509–513, 2019.
    Google ScholarLocate open access versionFindings
  • [31] M. W. Howe, P. L. Tierney, S. G. Sandberg, P. E. Phillips, and A. M. Graybiel, “Prolonged dopamine signalling in striatum signals proximity and value of distant rewards,” nature, vol. 500, no. 7464, pp. 575–579, 2013.
    Google ScholarLocate open access versionFindings
  • [32] J. W. Barter, S. Li, D. Lu, R. A. Bartholomew, M. A. Rossi, C. T. Shoemaker, D. Salas-Meza, E. Gaidis, and H. H. Yin, “Beyond reward prediction errors: the role of dopamine in movement kinematics,” Frontiers in integrative neuroscience, vol. 9, p. 39, 2015.
    Google ScholarLocate open access versionFindings
  • [33] N. F. Parker, C. M. Cameron, J. P. Taliaferro, J. Lee, J. Y. Choi, T. J. Davidson, N. D. Daw, and I. B. Witten, “Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target,” Nature neuroscience, vol. 19, no. 6, p. 845, 2016.
    Google ScholarLocate open access versionFindings
  • [34] N. Uchida, N. Eshel, and M. Watabe-Uchida, “Division of labor for division: inhibitory interneurons with different spatial landscapes in the olfactory system,” Neuron, vol. 80, no. 5, pp. 1106–1109, 2013.
    Google ScholarLocate open access versionFindings
  • [35] K. H. Shankar and M. W. Howard, “A scale-invariant internal representation of time,” Neural Computation, vol. 24, no. 1, pp. 134–193, 2012.
    Google ScholarLocate open access versionFindings
  • [36] I. J. Day, “On the inversion of diffusion nmr data: Tikhonov regularization and optimal choice of the regularization parameter,” Journal of Magnetic Resonance, vol. 211, no. 2, pp. 178–185, 2011.
    Google ScholarLocate open access versionFindings
  • [37] A. E. Yagle, “Regularized matrix computations,” matrix, vol. 500, p. 10, 2005.
    Google ScholarLocate open access versionFindings
  • [38] Z. Tiganj, K. H. Shankar, and M. W. Howard, “Scale invariant value computation for reinforcement learning in continuous time.,” in AAAI Spring Symposia, 2017.
    Google ScholarLocate open access versionFindings
  • [39] W. R. Stauffer, A. Lak, and W. Schultz, “Dopamine reward prediction error responses reflect marginal utility,” Current biology, vol. 24, no. 21, pp. 2491–2500, 2014.
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科