## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# A local temporal difference code for distributional reinforcement learning

NIPS 2020, (2020)

EI

Keywords

Abstract

Recent theoretical and experimental results suggest that the dopamine system implements distributional temporal difference backups, allowing learning of the entire distributions of the long-run values of states rather than just their expected values. However, the distributional codes explored so far rely on a complex imputation step which...More

Code:

Data:

Introduction

- In the traditional Reinforcement Learning (RL) framework, agents make decisions by learning and maximizing the scalar values of states, which quantify the expected sums of discounted future rewards that will be encountered from those states [1].
- In addition to the value distribution, it is possible to recover the temporal evolution of the immediate reward distribution from the code, by taking an inverse Laplace operator.

Highlights

- In the traditional Reinforcement Learning (RL) framework, agents make decisions by learning and maximizing the scalar values of states, which quantify the expected sums of discounted future rewards that will be encountered from those states [1]
- We developed a distributional RL code for value that has two major advantages over the one recently proposed by Dabney et al [10]: (1) it allows the agent to recover the value distributions and the temporal evolution of the reward distribution; and (2) it can be learned with a local learning rule, as opposed to the rule of Dabney et al which requires that a unit knows the states of other units to update its parameters
- The latter property is appealing and realistic from a biological point of view and leads to a significantly simpler convergence analysis compared to non-local codes, for which convergence needs to be proven in a distributional sense
- We showed that our distributional code can be computed linearly from an ensemble of systems computing the successor representation with different temporal discounts, a model proposed for the hippocampus [11, 22]
- For the first two of these, dopamine neurons in the Ventral Tegmental Area (VTA) apparently code for varying reward magnitudes [19]; and it has been proposed that value functions and/or temporal difference (TD) prediction errors for different values of γ are arranged in a spatially organized manner along the dorso-ventral axis of the striatum [23, 24]
- It is still unclear if time horizon is separated from temporal discount in the dopamine system, but some experimental results suggest a discrete coding of temporal horizon [27]

Results

- If the resolution along the h-dimension is high enough, applying L−1 to the converging points of the TD backups in Eq 10 recovers {P}Tτ=0, the set of probability distributions of immediate rewards at all future timesteps until T, given that the current state is st.
- As illustrated in Fig. 3, the Laplace code recovers a temporal map of the problem, indicating all the possible rewards at all future times.
- The authors' local code recovers the value distribution and the temporal evolution of rewards {P}τ .
- In Fig. 5d the authors compare flexibility to a horizon change between the Laplace code and the Expectile code, whose estimates need to re-converge to the new value distribution at s under the new horizon T .
- The authors developed a distributional RL code for value that has two major advantages over the one recently proposed by Dabney et al [10]: (1) it allows the agent to recover the value distributions and the temporal evolution of the reward distribution; and (2) it can be learned with a local learning rule, as opposed to the rule of Dabney et al which requires that a unit knows the states of other units to update its parameters.
- The authors' code decomposes value distributions and prediction errors across three separated dimensions: reward magnitude, temporal discounting and time horizon.
- In a more general framework the convergence points of the Laplace code are a form of General Value Function [28], which has been proposed as a unifying system to learn about many different variables from the same line of experience [29].

Conclusion

- If rewards always occur at the same time, and they are distributed according to the distribution D, Vh in the Laplace code effectively learns Er∼D[fh(r)], the expected value of fh(r) when rewards are distributed as D.
- The additional error signals provided by the code could allow the system to learn richer representations than traditional RL [7], and possibly even richer than those learnt using an Expectile code, since hidden representations must distinguish between states with the same value distribution but different temporal evolution of the immediate reward distribution.

Summary

- In the traditional Reinforcement Learning (RL) framework, agents make decisions by learning and maximizing the scalar values of states, which quantify the expected sums of discounted future rewards that will be encountered from those states [1].
- In addition to the value distribution, it is possible to recover the temporal evolution of the immediate reward distribution from the code, by taking an inverse Laplace operator.
- If the resolution along the h-dimension is high enough, applying L−1 to the converging points of the TD backups in Eq 10 recovers {P}Tτ=0, the set of probability distributions of immediate rewards at all future timesteps until T, given that the current state is st.
- As illustrated in Fig. 3, the Laplace code recovers a temporal map of the problem, indicating all the possible rewards at all future times.
- The authors' local code recovers the value distribution and the temporal evolution of rewards {P}τ .
- In Fig. 5d the authors compare flexibility to a horizon change between the Laplace code and the Expectile code, whose estimates need to re-converge to the new value distribution at s under the new horizon T .
- The authors developed a distributional RL code for value that has two major advantages over the one recently proposed by Dabney et al [10]: (1) it allows the agent to recover the value distributions and the temporal evolution of the reward distribution; and (2) it can be learned with a local learning rule, as opposed to the rule of Dabney et al which requires that a unit knows the states of other units to update its parameters.
- The authors' code decomposes value distributions and prediction errors across three separated dimensions: reward magnitude, temporal discounting and time horizon.
- In a more general framework the convergence points of the Laplace code are a form of General Value Function [28], which has been proposed as a unifying system to learn about many different variables from the same line of experience [29].
- If rewards always occur at the same time, and they are distributed according to the distribution D, Vh in the Laplace code effectively learns Er∼D[fh(r)], the expected value of fh(r) when rewards are distributed as D.
- The additional error signals provided by the code could allow the system to learn richer representations than traditional RL [7], and possibly even richer than those learnt using an Expectile code, since hidden representations must distinguish between states with the same value distribution but different temporal evolution of the immediate reward distribution.

Funding

- Funding disclosure A.P. and P.T. were supported by SNF grant #315230_197296. P.D. was supported by the Max Planck Society and the Alexander von Humboldt Foundation. The authors have no conflict of interests.

Reference

- R. S. Sutton, A. G. Barto, et al., Introduction to reinforcement learning, vol.
- 135. MIT press Cambridge, 1998.
- [2] M. F. Rushworth and T. E. Behrens, “Choice, uncertainty and value in prefrontal and cingulate cortex,” Nature neuroscience, vol. 11, no. 4, p. 389, 2008.
- [3] A. Kepecs and Z. F. Mainen, “A computational framework for the study of confidence in humans and animals,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 367, no. 1594, pp. 1322–1337, 2012.
- [4] H. Stojic, J. L. Orquin, P. Dayan, R. J. Dolan, and M. Speekenbrink, “Uncertainty in learning, choice, and visual fixation,” Proceedings of the National Academy of Sciences, vol. 117, no. 6, pp. 3291–3300, 2020.
- [5] T. E. Behrens, M. W. Woolrich, M. E. Walton, and M. F. Rushworth, “Learning the value of information in an uncertain world,” Nature neuroscience, vol. 10, no. 9, pp. 1214–1221, 2007.
- [6] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458, JMLR. org, 2017.
- [7] C. Lyle, M. G. Bellemare, and P. S. Castro, “A comparative analysis of expected and distributional reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4504–4511, 2019.
- [8] M. Rowland, M. G. Bellemare, W. Dabney, R. Munos, and Y. W. Teh, “An analysis of categorical distributional reinforcement learning,” arXiv preprint arXiv:1802.08163, 2018.
- [9] M. Rowland, R. Dadashi, S. Kumar, R. Munos, M. G. Bellemare, and W. Dabney, “Statistics and samples in distributional reinforcement learning,” arXiv preprint arXiv:1902.08102, 2019.
- [10] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather, D. Hassabis, R. Munos, and M. Botvinick, “A distributional code for value in dopamine-based reinforcement learning,” Nature, pp. 1–5, 2020.
- [11] I. Momennejad and M. W. Howard, “Predicting the future with multi-scale successor representations,” BioRxiv, p. 449470, 2018.
- [12] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
- [13] W. Schultz, “Predictive reward signal of dopamine neurons,” Journal of neurophysiology, vol. 80, no. 1, pp. 1–27, 1998.
- [14] P. Dayan and L. F. Abbott, Theoretical neuroscience: computational and mathematical modeling of neural systems. Computational Neuroscience Series, 2001.
- [15] D. J. Foster and M. A. Wilson, “Reverse replay of behavioural sequences in hippocampal place cells during the awake state,” Nature, vol. 440, no. 7084, pp. 680–683, 2006.
- [16] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual review of neuroscience, vol. 24, no. 1, pp. 1193–1216, 2001.
- [17] K. Louie, P. W. Glimcher, and R. Webb, “Adaptive neural coding: from biological to behavioral decision-making,” Current opinion in behavioral sciences, vol. 5, pp. 91–99, 2015.
- [18] S. Hong, B. N. Lundstrom, and A. L. Fairhall, “Intrinsic gain modulation and adaptive neural coding,” PLoS Computational Biology, vol. 4, no. 7, 2008.
- [19] N. Eshel, J. Tian, M. Bukwich, and N. Uchida, “Dopamine neurons share common response function for reward prediction error,” Nature neuroscience, vol. 19, no. 3, p. 479, 2016.
- [20] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [21] P. Dayan, “Improving generalization for temporal difference learning: The successor representation,” Neural Computation, vol. 5, no. 4, pp. 613–624, 1993.
- [22] I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman, “The successor representation in human reinforcement learning,” Nature Human Behaviour, vol. 1, no. 9, pp. 680–692, 2017.
- [23] S. C. Tanaka, N. Schweighofer, S. Asahi, K. Shishida, Y. Okamoto, S. Yamawaki, and K. Doya, “Serotonin differentially regulates short-and long-term prediction of rewards in the ventral and dorsal striatum,” PloS one, vol. 2, no. 12, 2007.
- [24] X. Cai, S. Kim, and D. Lee, “Heterogeneous coding of temporally discounted values in the dorsal and ventral striatum during intertemporal choice,” Neuron, vol. 69, no. 1, pp. 170–182, 2011.
- [25] S. C. Tanaka, K. Shishida, N. Schweighofer, Y. Okamoto, S. Yamawaki, and K. Doya, “Serotonin affects association of aversive outcomes to past actions,” Journal of Neuroscience, vol. 29, no. 50, pp. 15669–15674, 2009.
- [26] N. Schweighofer, S. C. Tanaka, and K. Doya, “Serotonin and the evaluation of future rewards: theory, experiments, and possible neural mechanisms,” Annals of the New York Academy of Sciences, vol. 1104, no. 1, pp. 289–300, 2007.
- [27] E. S. Bromberg-Martin, M. Matsumoto, H. Nakahara, and O. Hikosaka, “Multiple timescales of memory in lateral habenula and dopamine neurons,” Neuron, vol. 67, no. 3, pp. 499–510, 2010.
- [28] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,” in The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768, 2011.
- [29] G. Comanici, D. Precup, A. Barreto, D. K. Toyama, E. Aygün, P. Hamel, S. Vezhnevets, S. Hou, and S. Mourad, “Knowledge representation for reinforcement learning using general value functions,” 2018.
- [30] B. Engelhard, J. Finkelstein, J. Cox, W. Fleming, H. J. Jang, S. Ornelas, S. A. Koay, S. Y. Thiberge, N. D. Daw, D. W. Tank, et al., “Specialized coding of sensory, motor and cognitive variables in vta dopamine neurons,” Nature, vol. 570, no. 7762, pp. 509–513, 2019.
- [31] M. W. Howe, P. L. Tierney, S. G. Sandberg, P. E. Phillips, and A. M. Graybiel, “Prolonged dopamine signalling in striatum signals proximity and value of distant rewards,” nature, vol. 500, no. 7464, pp. 575–579, 2013.
- [32] J. W. Barter, S. Li, D. Lu, R. A. Bartholomew, M. A. Rossi, C. T. Shoemaker, D. Salas-Meza, E. Gaidis, and H. H. Yin, “Beyond reward prediction errors: the role of dopamine in movement kinematics,” Frontiers in integrative neuroscience, vol. 9, p. 39, 2015.
- [33] N. F. Parker, C. M. Cameron, J. P. Taliaferro, J. Lee, J. Y. Choi, T. J. Davidson, N. D. Daw, and I. B. Witten, “Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target,” Nature neuroscience, vol. 19, no. 6, p. 845, 2016.
- [34] N. Uchida, N. Eshel, and M. Watabe-Uchida, “Division of labor for division: inhibitory interneurons with different spatial landscapes in the olfactory system,” Neuron, vol. 80, no. 5, pp. 1106–1109, 2013.
- [35] K. H. Shankar and M. W. Howard, “A scale-invariant internal representation of time,” Neural Computation, vol. 24, no. 1, pp. 134–193, 2012.
- [36] I. J. Day, “On the inversion of diffusion nmr data: Tikhonov regularization and optimal choice of the regularization parameter,” Journal of Magnetic Resonance, vol. 211, no. 2, pp. 178–185, 2011.
- [37] A. E. Yagle, “Regularized matrix computations,” matrix, vol. 500, p. 10, 2005.
- [38] Z. Tiganj, K. H. Shankar, and M. W. Howard, “Scale invariant value computation for reinforcement learning in continuous time.,” in AAAI Spring Symposia, 2017.
- [39] W. R. Stauffer, A. Lak, and W. Schultz, “Dopamine reward prediction error responses reflect marginal utility,” Current biology, vol. 24, no. 21, pp. 2491–2500, 2014.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn