## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), (2019): 14111-14121

EI

摘要

In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that poor performance of low discount factors is caused by (too) small action-gaps...更多

代码：

数据：

简介

- In reinforcement learning (RL), the objective that one wants to optimize for is often best described as an undiscounted sum of rewards and a discount factor is merely introduced so as to avoid some of the optimization challenges that can occur when directly optimizing on an undiscounted objective [Bertsekas and Tsitsiklis, 1996].
- One surprising finding is that for some problems a low discount factor can result in better asymptotic performance, when a finite-horizon, undiscounted objective is indirectly optimized through the proxy of an infinitehorizon, discounted sum
- This motivates them to look deeper into the effect of the discount factor on the optimization process

重点内容

- In reinforcement learning (RL), the objective that one wants to optimize for is often best described as an undiscounted sum of rewards and a discount factor is merely introduced so as to avoid some of the optimization challenges that can occur when directly optimizing on an undiscounted objective [Bertsekas and Tsitsiklis, 1996]
- One surprising finding is that for some problems a low discount factor can result in better asymptotic performance, when a finite-horizon, undiscounted objective is indirectly optimized through the proxy of an infinitehorizon, discounted sum
- Combined with the observation from Section 3.1 that the best discount factor is task-dependent, and the convergence proof in 0.8 average performance the supplementary material, which guarantees that logarithmic Q- 0.6 learning converges to the same policy as regular Q-learning, these 0.4 results demonstrate that logarithmic Q-learning is able to solve tasks that are challenging to solve with Q-learning
- Our results provide strong evidence for our hypothesis that large differences in action-gap sizes are detrimental to the performance of approximate reinforcement learning
- The errorlandscape required to bring the approximation error below the action-gap across the state-space has a very different shape if the Figure 9: Human-normalized action-gap is orders of magnitude different in size across the state- mean and median space. This mismatch between the required error-landscape and scores on 55 Atari games for that produced by the L2-norm might lead to an ineffective use LogDQN and various other alof the function approximator
- We believe a possible reason could be that since such low values are very different than the original Deep Q-Networks settings, some of the other Deep Q-Networks hyper-parameters might no longer be ideal in the low discount factor region

方法

- The authors test the method by returning to the full version of the chain task and the same performance metric F as used in Section 3.2, which measures whether or not the greedy policy is optimal.
- Figure 8 plots the result for early learning as well as the final performance
- Comparing these graphs with the graphs from Figure 3 shows that logarithmic Q-learning has successfully resolved the optimization issues of regular Q-learning related to the use of low discount factors in average performance.
- The authors test the approach in a more complex setting by compar- and final performance on the chain task. that implements the method, which the authors will refer to as LogDQN.3

结论

- The errorlandscape required to bring the approximation error below the action-gap across the state-space has a very different shape if the Figure 9: Human-normalized action-gap is orders of magnitude different in size across the state- mean and median space
- This mismatch between the required error-landscape and scores on 55 Atari games for that produced by the L2-norm might lead to an ineffective use LogDQN and various other alof the function approximator.
- An interesting future direction would be to re-evaluate some of the other hyper-parameters in the low discount factor region

总结

## Introduction:

In reinforcement learning (RL), the objective that one wants to optimize for is often best described as an undiscounted sum of rewards and a discount factor is merely introduced so as to avoid some of the optimization challenges that can occur when directly optimizing on an undiscounted objective [Bertsekas and Tsitsiklis, 1996].- One surprising finding is that for some problems a low discount factor can result in better asymptotic performance, when a finite-horizon, undiscounted objective is indirectly optimized through the proxy of an infinitehorizon, discounted sum
- This motivates them to look deeper into the effect of the discount factor on the optimization process
## Methods:

The authors test the method by returning to the full version of the chain task and the same performance metric F as used in Section 3.2, which measures whether or not the greedy policy is optimal.- Figure 8 plots the result for early learning as well as the final performance
- Comparing these graphs with the graphs from Figure 3 shows that logarithmic Q-learning has successfully resolved the optimization issues of regular Q-learning related to the use of low discount factors in average performance.
- The authors test the approach in a more complex setting by compar- and final performance on the chain task. that implements the method, which the authors will refer to as LogDQN.3
## Conclusion:

The errorlandscape required to bring the approximation error below the action-gap across the state-space has a very different shape if the Figure 9: Human-normalized action-gap is orders of magnitude different in size across the state- mean and median space- This mismatch between the required error-landscape and scores on 55 Atari games for that produced by the L2-norm might lead to an ineffective use LogDQN and various other alof the function approximator.
- An interesting future direction would be to re-evaluate some of the other hyper-parameters in the low discount factor region

引用论文

- Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.
- Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992.
- Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, pages 703–710, 1994.
- Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4045–4054, 2018.
- Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, pages 1038–1044, 1996.
- Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1476–1483, 2016.
- Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems, pages 172–180, 2011.
- Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Vecerík, Matteo Hessel, Rémi Munos, and Olivier Pietquin. Observe and look further: Achieving consistent performance on Atari. arXiv preprint arXiv:1805.11593, 2018.
- Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110, 2018.
- Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013.
- Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
- Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pages 1995–2003, 2016.
- Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 449–458, 2017.
- Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1096–1105, 2018.
- Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 3215–3222, 2018.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn