## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# A Distributional Perspective on Reinforcement Learning.

ICML, (2017): 449-458

EI

关键词

摘要

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature s...更多

代码：

数据：

简介

- One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility Q, or value (Sutton & Barto, 1998).
- The optimality operator is a contraction in expected value, it is not a contraction in any metric over distributions
- These results provide evidence in favour of learning algorithms that model the effects of nonstationary policies.
- By modelling the value distribution within a DQN agent (Mnih et al, 2015), the authors obtain considerably increased performance across the gamut of benchmark Atari 2600 games, and achieve stateof-the-art performance on a number of games
- The authors' results echo those of Veness et al (2015), who obtained extremely fast learning by predicting Monte Carlo returns.
- It is the belief that this guesswork carries more benefits than costs

重点内容

- One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility Q, or value (Sutton & Barto, 1998)
- The distributional Bellman equation states that the distribution of Z is characterized by the interaction of three random variables: the reward R, the state-action (X , A ), and its random return Z(X , A )
- We believe the value distribution has a central role to play in reinforcement learning
- Basing ourselves on results by Rosler (1992) we show that, for a fixed policy, the Bellman operator over value distributions is a contraction in a maximal form of the Wasserstein metric
- Approximating the full distribution mitigates the effects of learning from a nonstationary policy. We argue that this approach makes approximate reinforcement learning significantly better behaved
- We found that learning value distributions is a powerful notion that allows us to surpass most gains previously made on Atari 2600, without further algorithmic adjustments

结果

**Evaluation on Atari**

2600 Games

To understand the approach in a complex setting, the authors applied the categorical algorithm to games from the Ar-

cade Learning Environment (ALE; Bellemare et al, 2013).- The authors use the DQN architecture (Mnih et al, 2015), but output the atom probabilities pi(x, a) instead of action-values, and chose VMAX = −VMIN = 10 from preliminary experiments over the training games.
- Figure 4 illustrates the typical value distributions the authors observed in the experiments
- In this example, three actions lead to the agent releasing its laser too early and eventually losing the game.
- The authors believe this is due to the discretizing the diffusion process induced by γ

结论

- In this work the authors sought a more complete picture of reinforcement learning, one that involves value distributions.
- The authors found that learning value distributions is a powerful notion that allows them to surpass most gains previously made on Atari 2600, without further algorithmic adjustments.
- Why does learning a distribution matter?
- The distinction the authors wish to make is that learning distributions matters in the presence of approximation.
- When combined with function approximation, this instability may prevent the policy from converging, what Gordon (1995) called chattering.
- The authors believe the gradient-based categorical algorithm is able to mitigate these effects by effectively averaging the different distri-

总结

## Introduction:

One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility Q, or value (Sutton & Barto, 1998).- The optimality operator is a contraction in expected value, it is not a contraction in any metric over distributions
- These results provide evidence in favour of learning algorithms that model the effects of nonstationary policies.
- By modelling the value distribution within a DQN agent (Mnih et al, 2015), the authors obtain considerably increased performance across the gamut of benchmark Atari 2600 games, and achieve stateof-the-art performance on a number of games
- The authors' results echo those of Veness et al (2015), who obtained extremely fast learning by predicting Monte Carlo returns.
- It is the belief that this guesswork carries more benefits than costs
## Results:

**Evaluation on Atari**

2600 Games

To understand the approach in a complex setting, the authors applied the categorical algorithm to games from the Ar-

cade Learning Environment (ALE; Bellemare et al, 2013).- The authors use the DQN architecture (Mnih et al, 2015), but output the atom probabilities pi(x, a) instead of action-values, and chose VMAX = −VMIN = 10 from preliminary experiments over the training games.
- Figure 4 illustrates the typical value distributions the authors observed in the experiments
- In this example, three actions lead to the agent releasing its laser too early and eventually losing the game.
- The authors believe this is due to the discretizing the diffusion process induced by γ
## Conclusion:

In this work the authors sought a more complete picture of reinforcement learning, one that involves value distributions.- The authors found that learning value distributions is a powerful notion that allows them to surpass most gains previously made on Atari 2600, without further algorithmic adjustments.
- Why does learning a distribution matter?
- The distinction the authors wish to make is that learning distributions matters in the presence of approximation.
- When combined with function approximation, this instability may prevent the policy from converging, what Gordon (1995) called chattering.
- The authors believe the gradient-based categorical algorithm is able to mitigate these effects by effectively averaging the different distri-

引用论文

- Azar, Mohammad Gheshlaghi, Munos, Remi, and Kappen, Hilbert. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, 2012.
- Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellemare, Marc G., Danihelka, Ivo, Dabney, Will, Mohamed, Shakir, Lakshminarayanan, Balaji, Hoyer, Stephan, and Munos, Remi. The cramer distance as a solution to biased wasserstein gradients. arXiv, 2017.
- Bellman, Richard E. Dynamic programming. Princeton University Press, Princeton, NJ, 1957.
- Bertsekas, Dimitri P. and Tsitsiklis, John N. NeuroDynamic Programming. Athena Scientific, 1996.
- Bickel, Peter J. and Freedman, David A. Some asymptotic theory for the bootstrap. The Annals of Statistics, pp. 1196–1217, 1981.
- Billingsley, Patrick. Probability and measure. John Wiley & Sons, 1995.
- Caruana, Rich. Multitask learning. Machine Learning, 28 (1):41–75, 1997.
- Chung, Kun-Jen and Sobel, Matthew J. Discounted mdps: Distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 25(1): 49–62, 1987.
- Dearden, Richard, Friedman, Nir, and Russell, Stuart. Bayesian Q-learning. In Proceedings of the National Conference on Artificial Intelligence, 1998.
- Engel, Yaakov, Mannor, Shie, and Meir, Ron. Reinforcement learning with gaussian processes. In Proceedings of the International Conference on Machine Learning, 2005.
- Geist, Matthieu and Pietquin, Olivier. Kalman temporal differences. Journal of Artificial Intelligence Research, 39:483–532, 2010.
- Gordon, Geoffrey. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.
- Harutyunyan, Anna, Bellemare, Marc G., Stepleton, Tom, and Munos, Remi. Q(λ) with off-policy corrections. In Proceedings of the Conference on Algorithmic Learning Theory, 2016.
- Hoffman, Matthew D., de Freitas, Nando, Doucet, Arnaud, and Peters, Jan. An expectation maximization algorithm for continuous markov decision processes with arbitrary reward. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009.
- Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. Proceedings of the International Conference on Learning Representations, 2017.
- Jaquette, Stratton C. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1(3):496–505, 1973.
- Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2002.
- Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations, 2015.
- Lattimore, Tor and Hutter, Marcus. PAC bounds for discounted MDPs. In Proceedings of the Conference on Algorithmic Learning Theory, 2012.
- Mannor, Shie and Tsitsiklis, John N. Mean-variance optimization in markov decision processes. 2011.
- McCallum, Andrew K. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester, 1995.
- Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.
- Parametric return density estimation for reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010a. Morimura, Tetsuro, Sugiyama, Masashi, Kashima, Hisashi, Hachiya, Hirotaka, and Tanaka, Toshiyuki. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 799–806, 2010b. Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alcicek, Cagdas, Fearon, Rory, De Maria, Alessandro, Panneershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Charles, and Petersen, Stig et al. Massively parallel methods for deep reinforcement learning. In ICML Workshop on Deep Learning, 2015.
- Prashanth, LA and Ghavamzadeh, Mohammad. Actorcritic algorithms for risk-sensitive mdps. In Advances in Neural Information Processing Systems, 2013.
- Puterman, Martin L. Markov Decision Processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.
- Rosler, Uwe. A fixed point theorem for distributions. Stochastic Processes and their Applications, 42(2):195– 214, 1992.
- Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, 2016.
- Sobel, Matthew J. The variance of discounted markov decision processes. Journal of Applied Probability, 19(04): 794–802, 1982.
- Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 1998.
- Sutton, R.S., Modayil, J., Delp, M., Degris, T., Pilarski, P.M., White, A., and Precup, D. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagents Systems, 2011.
- Toussaint, Marc and Storkey, Amos. Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the International Conference on Machine Learning, 2006.
- Tsitsiklis, John N. On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3:59– 72, 2002.
- Utgoff, Paul E. and Stracuzzi, David J. Many-layered learning. Neural Computation, 14(10):2497–2529, 2002.
- Van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, 2016.
- van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
- Veness, Joel, Bellemare, Marc G., Hutter, Marcus, Chua, Alvin, and Desjardins, Guillaume. Compress and control. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
- Wang, Tao, Lizotte, Daniel, Bowling, Michael, and Schuurmans, Dale. Dual representations for dynamic programming. Journal of Machine Learning Research, pp. 1–29, 2008.
- Wang, Ziyu, Schaul, Tom, Hessel, Matteo, Hasselt, Hado van, Lanctot, Marc, and de Freitas, Nando. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
- White, D. J. Mean, variance, and probabilistic criteria in finite markov decision processes: a review. Journal of Optimization Theory and Applications, 56(1):1–29, 1988.
- Tamar, Aviv, Di Castro, Dotan, and Mannor, Shie. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17(13):1–36, 2016.
- Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn