AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We found that learning value distributions is a powerful notion that allows us to surpass most gains previously made on Atari 2600, without further algorithmic adjustments

A Distributional Perspective on Reinforcement Learning.

ICML, (2017): 449-458

被引用461|浏览328
EI
下载 PDF 全文
引用
微博一下

摘要

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature s...更多

代码

数据

0
简介
  • One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility Q, or value (Sutton & Barto, 1998).
  • The optimality operator is a contraction in expected value, it is not a contraction in any metric over distributions
  • These results provide evidence in favour of learning algorithms that model the effects of nonstationary policies.
  • By modelling the value distribution within a DQN agent (Mnih et al, 2015), the authors obtain considerably increased performance across the gamut of benchmark Atari 2600 games, and achieve stateof-the-art performance on a number of games
  • The authors' results echo those of Veness et al (2015), who obtained extremely fast learning by predicting Monte Carlo returns.
  • It is the belief that this guesswork carries more benefits than costs
重点内容
  • One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility Q, or value (Sutton & Barto, 1998)
  • The distributional Bellman equation states that the distribution of Z is characterized by the interaction of three random variables: the reward R, the state-action (X , A ), and its random return Z(X , A )
  • We believe the value distribution has a central role to play in reinforcement learning
  • Basing ourselves on results by Rosler (1992) we show that, for a fixed policy, the Bellman operator over value distributions is a contraction in a maximal form of the Wasserstein metric
  • Approximating the full distribution mitigates the effects of learning from a nonstationary policy. We argue that this approach makes approximate reinforcement learning significantly better behaved
  • We found that learning value distributions is a powerful notion that allows us to surpass most gains previously made on Atari 2600, without further algorithmic adjustments
结果
  • Evaluation on Atari

    2600 Games

    To understand the approach in a complex setting, the authors applied the categorical algorithm to games from the Ar-

    cade Learning Environment (ALE; Bellemare et al, 2013).
  • The authors use the DQN architecture (Mnih et al, 2015), but output the atom probabilities pi(x, a) instead of action-values, and chose VMAX = −VMIN = 10 from preliminary experiments over the training games.
  • Figure 4 illustrates the typical value distributions the authors observed in the experiments
  • In this example, three actions lead to the agent releasing its laser too early and eventually losing the game.
  • The authors believe this is due to the discretizing the diffusion process induced by γ
结论
  • In this work the authors sought a more complete picture of reinforcement learning, one that involves value distributions.
  • The authors found that learning value distributions is a powerful notion that allows them to surpass most gains previously made on Atari 2600, without further algorithmic adjustments.
  • Why does learning a distribution matter?
  • The distinction the authors wish to make is that learning distributions matters in the presence of approximation.
  • When combined with function approximation, this instability may prevent the policy from converging, what Gordon (1995) called chattering.
  • The authors believe the gradient-based categorical algorithm is able to mitigate these effects by effectively averaging the different distri-
总结
  • Introduction:

    One of the major tenets of reinforcement learning states that, when not otherwise constrained in its behaviour, an agent should aim to maximize its expected utility Q, or value (Sutton & Barto, 1998).
  • The optimality operator is a contraction in expected value, it is not a contraction in any metric over distributions
  • These results provide evidence in favour of learning algorithms that model the effects of nonstationary policies.
  • By modelling the value distribution within a DQN agent (Mnih et al, 2015), the authors obtain considerably increased performance across the gamut of benchmark Atari 2600 games, and achieve stateof-the-art performance on a number of games
  • The authors' results echo those of Veness et al (2015), who obtained extremely fast learning by predicting Monte Carlo returns.
  • It is the belief that this guesswork carries more benefits than costs
  • Results:

    Evaluation on Atari

    2600 Games

    To understand the approach in a complex setting, the authors applied the categorical algorithm to games from the Ar-

    cade Learning Environment (ALE; Bellemare et al, 2013).
  • The authors use the DQN architecture (Mnih et al, 2015), but output the atom probabilities pi(x, a) instead of action-values, and chose VMAX = −VMIN = 10 from preliminary experiments over the training games.
  • Figure 4 illustrates the typical value distributions the authors observed in the experiments
  • In this example, three actions lead to the agent releasing its laser too early and eventually losing the game.
  • The authors believe this is due to the discretizing the diffusion process induced by γ
  • Conclusion:

    In this work the authors sought a more complete picture of reinforcement learning, one that involves value distributions.
  • The authors found that learning value distributions is a powerful notion that allows them to surpass most gains previously made on Atari 2600, without further algorithmic adjustments.
  • Why does learning a distribution matter?
  • The distinction the authors wish to make is that learning distributions matters in the presence of approximation.
  • When combined with function approximation, this instability may prevent the policy from converging, what Gordon (1995) called chattering.
  • The authors believe the gradient-based categorical algorithm is able to mitigate these effects by effectively averaging the different distri-
引用论文
  • Azar, Mohammad Gheshlaghi, Munos, Remi, and Kappen, Hilbert. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Bellemare, Marc G., Danihelka, Ivo, Dabney, Will, Mohamed, Shakir, Lakshminarayanan, Balaji, Hoyer, Stephan, and Munos, Remi. The cramer distance as a solution to biased wasserstein gradients. arXiv, 2017.
    Google ScholarFindings
  • Bellman, Richard E. Dynamic programming. Princeton University Press, Princeton, NJ, 1957.
    Google ScholarFindings
  • Bertsekas, Dimitri P. and Tsitsiklis, John N. NeuroDynamic Programming. Athena Scientific, 1996.
    Google ScholarLocate open access versionFindings
  • Bickel, Peter J. and Freedman, David A. Some asymptotic theory for the bootstrap. The Annals of Statistics, pp. 1196–1217, 1981.
    Google ScholarLocate open access versionFindings
  • Billingsley, Patrick. Probability and measure. John Wiley & Sons, 1995.
    Google ScholarFindings
  • Caruana, Rich. Multitask learning. Machine Learning, 28 (1):41–75, 1997.
    Google ScholarFindings
  • Chung, Kun-Jen and Sobel, Matthew J. Discounted mdps: Distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 25(1): 49–62, 1987.
    Google ScholarLocate open access versionFindings
  • Dearden, Richard, Friedman, Nir, and Russell, Stuart. Bayesian Q-learning. In Proceedings of the National Conference on Artificial Intelligence, 1998.
    Google ScholarLocate open access versionFindings
  • Engel, Yaakov, Mannor, Shie, and Meir, Ron. Reinforcement learning with gaussian processes. In Proceedings of the International Conference on Machine Learning, 2005.
    Google ScholarLocate open access versionFindings
  • Geist, Matthieu and Pietquin, Olivier. Kalman temporal differences. Journal of Artificial Intelligence Research, 39:483–532, 2010.
    Google ScholarLocate open access versionFindings
  • Gordon, Geoffrey. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.
    Google ScholarLocate open access versionFindings
  • Harutyunyan, Anna, Bellemare, Marc G., Stepleton, Tom, and Munos, Remi. Q(λ) with off-policy corrections. In Proceedings of the Conference on Algorithmic Learning Theory, 2016.
    Google ScholarLocate open access versionFindings
  • Hoffman, Matthew D., de Freitas, Nando, Doucet, Arnaud, and Peters, Jan. An expectation maximization algorithm for continuous markov decision processes with arbitrary reward. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009.
    Google ScholarLocate open access versionFindings
  • Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. Proceedings of the International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Jaquette, Stratton C. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1(3):496–505, 1973.
    Google ScholarLocate open access versionFindings
  • Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2002.
    Google ScholarLocate open access versionFindings
  • Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Lattimore, Tor and Hutter, Marcus. PAC bounds for discounted MDPs. In Proceedings of the Conference on Algorithmic Learning Theory, 2012.
    Google ScholarLocate open access versionFindings
  • Mannor, Shie and Tsitsiklis, John N. Mean-variance optimization in markov decision processes. 2011.
    Google ScholarFindings
  • McCallum, Andrew K. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester, 1995.
    Google ScholarFindings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.
    Google ScholarLocate open access versionFindings
  • Parametric return density estimation for reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010a. Morimura, Tetsuro, Sugiyama, Masashi, Kashima, Hisashi, Hachiya, Hirotaka, and Tanaka, Toshiyuki. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 799–806, 2010b. Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alcicek, Cagdas, Fearon, Rory, De Maria, Alessandro, Panneershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Charles, and Petersen, Stig et al. Massively parallel methods for deep reinforcement learning. In ICML Workshop on Deep Learning, 2015.
    Google ScholarLocate open access versionFindings
  • Prashanth, LA and Ghavamzadeh, Mohammad. Actorcritic algorithms for risk-sensitive mdps. In Advances in Neural Information Processing Systems, 2013.
    Google ScholarLocate open access versionFindings
  • Puterman, Martin L. Markov Decision Processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.
    Google ScholarFindings
  • Rosler, Uwe. A fixed point theorem for distributions. Stochastic Processes and their Applications, 42(2):195– 214, 1992.
    Google ScholarLocate open access versionFindings
  • Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Sobel, Matthew J. The variance of discounted markov decision processes. Journal of Applied Probability, 19(04): 794–802, 1982.
    Google ScholarLocate open access versionFindings
  • Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 1998.
    Google ScholarFindings
  • Sutton, R.S., Modayil, J., Delp, M., Degris, T., Pilarski, P.M., White, A., and Precup, D. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagents Systems, 2011.
    Google ScholarLocate open access versionFindings
  • Toussaint, Marc and Storkey, Amos. Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the International Conference on Machine Learning, 2006.
    Google ScholarLocate open access versionFindings
  • Tsitsiklis, John N. On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3:59– 72, 2002.
    Google ScholarLocate open access versionFindings
  • Utgoff, Paul E. and Stracuzzi, David J. Many-layered learning. Neural Computation, 14(10):2497–2529, 2002.
    Google ScholarLocate open access versionFindings
  • Van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Veness, Joel, Bellemare, Marc G., Hutter, Marcus, Chua, Alvin, and Desjardins, Guillaume. Compress and control. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Wang, Tao, Lizotte, Daniel, Bowling, Michael, and Schuurmans, Dale. Dual representations for dynamic programming. Journal of Machine Learning Research, pp. 1–29, 2008.
    Google ScholarLocate open access versionFindings
  • Wang, Ziyu, Schaul, Tom, Hessel, Matteo, Hasselt, Hado van, Lanctot, Marc, and de Freitas, Nando. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • White, D. J. Mean, variance, and probabilistic criteria in finite markov decision processes: a review. Journal of Optimization Theory and Applications, 56(1):1–29, 1988.
    Google ScholarLocate open access versionFindings
  • Tamar, Aviv, Di Castro, Dotan, and Mannor, Shie. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17(13):1–36, 2016.
    Google ScholarLocate open access versionFindings
  • Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科