## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Emphatic Algorithms for Deep Reinforcement Learning

ICML, pp.5023-5033, (2021)

EI

Keywords

Abstract

Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD...More

Code:

Data:

Introduction

- A Markov decision process (MDP; Bellman, 1957) consists of finite sets of states S and actions A, a reward function r : S × A → R, a transition distribution P (s |s, a) s, s ∈

S, a ∈ A, and a discount factor γ. - A policy is a distribution over actions conditioned on the state: π(a|s).
- Each state St is associated with a feature vector φt, and the agent’s value estimates Vθ(s) are a parametric function of these features.
- TD(λ) (Sutton, 1988) is a widely used algorithm for policy evaluation where, on each step t, the parameters of Vθ are updated according to θt+1 =.
- TD(λ) uses bootstrapping, where the agent’s own value estimates Vθ(St) are used to update the values online, on each step, without waiting for the episodes to fully resolve.
- TD algorithms can be used to learn policies, by using similar updates to learn action values, or by combining value learning with policy gradients in actor-critic systems (Sutton et al, 2000)

Highlights

- A Markov decision process (MDP; Bellman, 1957) consists of finite sets of states S and actions A, a reward function r : S × A → R, a transition distribution P (s |s, a) s, s ∈

S, a ∈ A, and a discount factor γ - We extend the emphatic method to multi-step deep reinforcement learning (RL) learning targets, including an off-policy value-learning method known as ‘V-trace’ (Espeholt et al, 2018) that is often used in actor-critic systems
- We empirically analyze the properties of these new emphatic algorithms to observe how qualitative properties such as convergence, learning speed and variance manifest in practice. We examine these in the context of two small scale diagnostic off-policy policy evaluation problems: (1) a two-state MDP, shown in Figure 1, commonly used to highlight the instability of off-policy temporal difference (TD) with function approximation, and (2) the Collision Problem, shown in Figure 2, used in prior work to highlight the advantages of ETD compared with gradient TD methods such as TDC (Ghiassian et al, 2018)
- In the mixed update scheme, we tested emphatic trace family windowed ETD(λ) (WETD) and its variants WETD-ACE, WEVtrace applied to the Surreal baseline
- On Atari, we proposed a baseline agent Surreal that achieved a strong median human normalized score 403%, and is suitable for testing off-policy learning on auxiliary controls
- We observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment
- The WETD family were unstable at scale, whereas the NETD family performed well, the emphatic actor-critic agent NETD-ACE

Methods

- The authors' ultimate goal is to design emphatic algorithms that improve off-policy learning at scale, especially on actorcritic agents.
- The authors evaluated the emphatic algorithms on Atari games from the Arcade Learning Environment (Bellemare et al, 2013), a widely used deep RL benchmark.
- Data The authors use the raw pixel observations in RGB as they are provided by the environment, without down sampling or gray scaling them.
- The authors use an action repeat of 4, with max pooling over the last two frames and the life termination signal.
- This setup is similar to IMPALA (Espeholt et al, 2018) with the only difference being using the raw frames instead of down and gray scaled ones.
- In order to compare with closely related previous works, the authors adopted the conventional 200M frames training regime using online updates without experience replay

Results

- The authors compute the median human normalized scores across 57 games, averaged over seeds and an evaluation phase without learning.
- To compare any two agents, the authors view their scores on 57 games as 57 independent pairs of samples, similar to how one would test significance of a medical treatment on a population of different people, rather than testing same treatment on the same person multiple times.
- The p-value is the probability of the null hypothesis.
- NEVtrace on Surreal NETD on Surreal Surreal WETD on Surreal StacX(2020).
- Learning frames that the algorithm performs using the sign test (Arbuthnot, 1712).
- Results might be thought of as statistically significant when p < 0.05

Conclusion

- Emphatic Results Since the emphatic traces are derived using the steady state distributions following fixed policies, the authors expect that they would impact the results more towards the end of learning, as the agent stabilizes its learned policy with learning rate decay.
- In the mixed update scheme, the authors tested emphatic trace family WETD and its variants WETD-ACE, WEVtrace applied to the Surreal baseline.
- New emphatic algorithm families of WETD and NETD variants showed nice qualitative properties on off-policy diagnostic MDPs. New emphatic algorithm families of WETD and NETD variants showed nice qualitative properties on off-policy diagnostic MDPs
- For both families, clipping IS weights in computing emphatic traces turns out to be an effective way to reduce variance, so the authors applied this learning at scale.
- On Atari, the authors proposed a baseline agent Surreal that achieved a strong median human normalized score 403%, and is suitable for testing off-policy learning on auxiliary controls.
- The authors would like to investigate applying emphatic traces to a variety of off-policy learning targets and settings at scale

Summary

## Introduction:

A Markov decision process (MDP; Bellman, 1957) consists of finite sets of states S and actions A, a reward function r : S × A → R, a transition distribution P (s |s, a) s, s ∈

S, a ∈ A, and a discount factor γ.- A policy is a distribution over actions conditioned on the state: π(a|s).
- Each state St is associated with a feature vector φt, and the agent’s value estimates Vθ(s) are a parametric function of these features.
- TD(λ) (Sutton, 1988) is a widely used algorithm for policy evaluation where, on each step t, the parameters of Vθ are updated according to θt+1 =.
- TD(λ) uses bootstrapping, where the agent’s own value estimates Vθ(St) are used to update the values online, on each step, without waiting for the episodes to fully resolve.
- TD algorithms can be used to learn policies, by using similar updates to learn action values, or by combining value learning with policy gradients in actor-critic systems (Sutton et al, 2000)
## Methods:

The authors' ultimate goal is to design emphatic algorithms that improve off-policy learning at scale, especially on actorcritic agents.- The authors evaluated the emphatic algorithms on Atari games from the Arcade Learning Environment (Bellemare et al, 2013), a widely used deep RL benchmark.
- Data The authors use the raw pixel observations in RGB as they are provided by the environment, without down sampling or gray scaling them.
- The authors use an action repeat of 4, with max pooling over the last two frames and the life termination signal.
- This setup is similar to IMPALA (Espeholt et al, 2018) with the only difference being using the raw frames instead of down and gray scaled ones.
- In order to compare with closely related previous works, the authors adopted the conventional 200M frames training regime using online updates without experience replay
## Results:

The authors compute the median human normalized scores across 57 games, averaged over seeds and an evaluation phase without learning.- To compare any two agents, the authors view their scores on 57 games as 57 independent pairs of samples, similar to how one would test significance of a medical treatment on a population of different people, rather than testing same treatment on the same person multiple times.
- The p-value is the probability of the null hypothesis.
- NEVtrace on Surreal NETD on Surreal Surreal WETD on Surreal StacX(2020).
- Learning frames that the algorithm performs using the sign test (Arbuthnot, 1712).
- Results might be thought of as statistically significant when p < 0.05
## Conclusion:

Emphatic Results Since the emphatic traces are derived using the steady state distributions following fixed policies, the authors expect that they would impact the results more towards the end of learning, as the agent stabilizes its learned policy with learning rate decay.- In the mixed update scheme, the authors tested emphatic trace family WETD and its variants WETD-ACE, WEVtrace applied to the Surreal baseline.
- New emphatic algorithm families of WETD and NETD variants showed nice qualitative properties on off-policy diagnostic MDPs. New emphatic algorithm families of WETD and NETD variants showed nice qualitative properties on off-policy diagnostic MDPs
- For both families, clipping IS weights in computing emphatic traces turns out to be an effective way to reduce variance, so the authors applied this learning at scale.
- On Atari, the authors proposed a baseline agent Surreal that achieved a strong median human normalized score 403%, and is suitable for testing off-policy learning on auxiliary controls.
- The authors would like to investigate applying emphatic traces to a variety of off-policy learning targets and settings at scale

- Table1: Look-up table for our emphatic algorithms and the two baseline algorithms without emphatic traces. π∗ is the optimal policy for n-step TD learning. πρis the fixed point target policy of V-trace (Eq 6). When applied to Surreal (explained in Sec. 4) for large scale experiments, we always clip IS weights in computing emphatic traces to reduce variance except for NEVtrace and WEVtrace
- Table2: Performance statistics for baseline Surreal and emphatic traces applied to Surreal in the fixed update scheme with n = 10, on 57 Atari games. Scores are human normalized, averaged across 3 random seeds and across the evaluation phase
- Table3: Network architecture Parameter convolutions in block (2, 2, 2, 2)
- Table4: Hyperparameters table

Funding

- We observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment
- We demonstrate that combining emphatic trace with deep neural networks can improve performance on classic Atari video games in Sec. 4, reporting the highest score to date for an RL agent without experience replay in the 200M frames data regime: 497% median human normalized score across 57 games, improved from the baseline performance of 403%

Study subjects and analysis

data: 1

Clip-NETD/Clip-WETD. n-step TD(n=1). (a) n-step TD target

Reference

- Arbuthnot, J. II. An argument for divine providence, taken from the constant regularity observ’d in the births of both sexes. By Dr. John Arbuthnott, Physitian in Ordinary to Her Majesty, and Fellow of the College of Physitians and the Royal Society. Philosophical Transactions of the Royal Society of London, 27(328):186–190, 1712.
- Baird, L. Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning, pp. 30– 37, 1995.
- Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, 1957.
- Budden, D., Hessel, M., Quan, J., Kapturowski, S., Baumli, K., Bhupatiraju, S., Guy, A., and King, M. RLax: Reinforcement Learning in JAX, 2020. URL http://github.com/deepmind/rlax.
- Degris, T. and Modayil, J. Scaling-up knowledge for a cognizant robot. AAAI Spring Symposium: Designing Intelligent Robots, 2012.
- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. CoRR, 2018.
- Ghiassian, S., Patterson, A., White, M., Sutton, R. S., and White, A. Online off-policy prediction. arXiv preprint arXiv:1811.02597, 2018.
- Hallak, A., Tamar, A., Munos, R., and Mannor, S. Generalized emphatic temporal difference learning: Biasvariance analysis. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016.
- Hennigan, T., Cai, T., Norman, T., and Babuschkin, I. Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku.
- Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. Multi-task deep reinforcement learning with popart. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3796–3803, 2019.
- Hessel, M., Budden, D., Viola, F., Rosca, M., Sezener, E., and Hennigan, T. Optax: composable gradient transformation and optimisation, in JAX!, 2020. URL http://github.com/deepmind/optax.
- Hessel, M., Kroiss, M., Clark, A., Kemaev, I., Quan, J., Keck, T., Viola, F., and van Hasselt, H. Podracer architectures for scalable reinforcement learning. 2021. URL https://arxiv.org/pdf/2104.06272.pdf.
- Imani, E., Graves, E., and White, M. An off-policy policy gradient theorem using emphatic weightings. Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
- Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2017.
- Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3):293–321, 1992.
- Mahmood, A. R., Yu, H., and Sutton, R. S. Multi-step off-policy learning without importance sampling ratios. arXiv preprint arXiv:1702.03006, 2017.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 2015.
- Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. ICML, pp. 417–424, 2001.
- Sutton, R. S. Implementation details of the td(λ) procedure for the case of vector predictions and backpropagation. GTE Laboratories Technical Note TN87-509.1, 1987.
- Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 2018.
- Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 13, 12:1057–1063, 2000.
- Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, E. Fast gradientdescent methods for temporal-difference learning with linear function approximation. pp. 993–1000. ACM, 2009.
- Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768, 2011.
- Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17 (1):2603–2631, 2016.
- Tsitsiklis, J. N. and Van Roy, B. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997.
- van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad. CoRR, abs/1812.02648, 2018. URL http://arxiv.org/abs/1812.02648.
- van Hasselt, H., Madjiheurem, S., Hessel, M., Silver, D., Barreto, A., and Borsa, D. Expected eligibility traces. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9997–10005, May 2021.
- Watkins, C. J. C. H. Learning from delayed rewards. 1989.
- Yu, H. On convergence of emphatic temporal-difference learning. JMLR: Workshop and Conference Proceedings, 40:1–28, 2015.
- Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., van Hasselt, H., Silver, D., and Singh, S. A self-tuning actorcritic algorithm. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn