AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have shown the implications of key components of the state-of-the-art reinforcement learning algorithms, such as value smoothing, entropy regularization and neural network approximators, on convergence of actorcritic and value-based RL algorithms

On the Convergence of Smooth Regularized Approximate Value Iteration Schemes

NIPS 2020, (2020)

被引用0|浏览16
EI
下载 PDF 全文
引用
微博一下

摘要

Entropy regularization, smoothing of Q-values and neural network function approximator are key components of the state-of-the-art reinforcement learning (RL) algorithms, such as Soft Actor-Critic [<a class="ref-link" id="c1" href="#r1">1</a>]. Despite the widespread use, the impact of these core techniques on the convergence of RL algorit...更多

代码

数据

0
简介
  • The reinforcement learning (RL) algorithms are faced with a challenge of maximizing the cumulative reward given a finite sample of environment transitions and inexact representation of policy and value function.
  • A number of techniques is commonly used in the large-scale RL setting, namely, entropy regularization, smoothing of Q-values and neural network function approximation.
  • The authors carry the error propagation analysis of abstract algorithms implementing entropy regularization and value smoothing using approximate dynamic-programming framework [13].
重点内容
  • In practical settings, the reinforcement learning (RL) algorithms are faced with a challenge of maximizing the cumulative reward given a finite sample of environment transitions and inexact representation of policy and value function
  • State-of-the-art RL algorithms have been successful in solving complex environments and, overcoming inaccuracies and their accumulation
  • A number of techniques is commonly used in the large-scale RL setting, namely, entropy regularization, smoothing of Q-values and neural network function approximation
  • Besides entropy regularization and value smoothing, notable progress on complex environments has been achieved with the use of neural network function approximators [6]
  • Our work is different from the above-cited work since (1) we study a fixed value smoothing through a new type of Bellman operator, (2) we provide an alternative analysis of the regularized value iteration that highlights the effectiveness of entropy regularization
  • We have shown the implications of key components of the state-of-the-art RL algorithms, such as value smoothing, entropy regularization and neural network approximators, on convergence of actorcritic and value-based RL algorithms
结果
  • Soft Actor-Critic algorithm [1], the authors provide error bounds for an abstract algorithm that combines smoothing with regularization and utilises neural network function approximation.
  • The authors consider the regularized Bellman operator with regularization function given by the negative entropy and a time-varying regularization parameter.
  • Towards a dynamic temperature adjustment, one can upper bound the size of overestimation errors using the variance of Q-values [26, 3.3.2] and approximate the regularization gap using the scaled entropy of the current policy, see (12).
  • Motivated by the Soft Actor-Critic algorithm, the authors analyse an abstract algorithmic scheme that combines the entropy regularization (Sec. 4) with the value smoothing (Sec. 3).
  • The authors study the function approximation errors induced by the policy and value network as in the Soft Actor-Critic algorithm (Sec. 5.3).
  • Following Sec. 3.1, one can define the optimal smooth regularized Bellman operator TΩ,β and corresponding set of greedy policies GΩ,β.
  • For any initial value function V0, consider the smooth regularized AVI scheme (19) with smoothing parameter β ∈ [0, 1) and time-varying temperature parametert > 0.
  • The authors detail the approximation errors a t in the case of neural network function class, utilised in the large-scale RL setting in general and, in particular, in the Soft Actor-Critic algorithm.
  • Smooth reg-AVI update The authors consider that the value function is approximated using a value neural network V := Vθ ∈ RS through the minimization problem (22).
  • Let them denote the value network V (t) := Vθ θ=θ(t) with sufficiently large width and the Bellman update bk+1 := TΩ,βVk. if the limiting NTK of the neural net Vθ is positive definite, i.e its smallest eigenvalue is positive λmin(K) > 0, the following contraction holds almost surely over all initializations θ(0) of the neural network
结论
  • The authors have shown the implications of key components of the state-of-the-art RL algorithms, such as value smoothing, entropy regularization and neural network approximators, on convergence of actorcritic and value-based RL algorithms.
  • The authors carried the error propagation analysis of abstract algorithms implementing entropy regularization and value smoothing using approximate dynamic-programming framework, and provided explicit bounds on the error to optimality.
  • The authors' analysis builds on the top of approximate dynamic-programming framework and might not cover all the implications of the above-mentioned techniques
相关工作
  • Prior works build their analysis on the top of approximate dynamic programming framework [13] (ADP). Regularized ADP [8] unifies (relative) entropy regularized algorithms through the use of regularization function. A related value-based scheme [9] generalizes the entropy regularized value iteration and gap-increasing methods. Another extension of ADP [15] proposes a value iteration algorithm with time-varied degree of value smoothing. Yet another study [16] shows that the KL divergence regularizer leads to the error averaging effect. Our work is different from the above-cited work since (1) we study a fixed value smoothing through a new type of Bellman operator, (2) we provide an alternative analysis of the regularized value iteration that highlights the effectiveness of entropy regularization.
引用论文
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1861–1870, 2018.
    Google ScholarLocate open access versionFindings
  • Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Machine Learning Department, Carnegie Mellon University, Dec 2010.
    Google ScholarFindings
  • Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 202–211, 2016.
    Google ScholarLocate open access versionFindings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, pages 1587–1596, 2018.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
    Findings
  • Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning, pages 2160–2169, 2019.
    Google ScholarLocate open access versionFindings
  • Tadashi Kozuno, Eiji Uchibe, and Kenji Doya. Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2995–3003, 2019.
    Google ScholarLocate open access versionFindings
  • Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.
    Google ScholarLocate open access versionFindings
  • Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. arXiv preprint arXiv:1909.02769, 2019.
    Findings
  • Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy in policy learning. In Proceedings of the 36th International Conference on Machine Learning, pages 151–160, 2019.
    Google ScholarLocate open access versionFindings
  • Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16:1629–1676, 2015.
    Google ScholarLocate open access versionFindings
  • Hado V Hasselt. Double q-learning. In Advances in neural information processing systems, pages 2613–2621, 2010.
    Google ScholarLocate open access versionFindings
  • Nino Vieillard, Bruno Scherrer, Olivier Pietquin, and Matthieu Geist. Momentum in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2529–2538, 2020.
    Google ScholarLocate open access versionFindings
  • Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. Leverage the average: an analysis of regularization in rl. arXiv preprint arXiv:2003.14089, 2020.
    Findings
  • Martin L Puterman. Markov decision processes. Wiley, New York, 1994.
    Google ScholarFindings
  • Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8572–8583, 2019.
    Google ScholarLocate open access versionFindings
  • Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
    Google ScholarLocate open access versionFindings
  • Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5, 2016.
    Google ScholarLocate open access versionFindings
  • Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the softmax Bellman operator: New benefits and new perspective. pages 5916–5925, 2019.
    Google ScholarFindings
  • Leemon C Baird. Reinforcement learning through gradient descent. 1999.
    Google ScholarFindings
  • Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip S Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems, pages 172–180, 2011.
    Google ScholarLocate open access versionFindings
  • Hado Philip van Hasselt. Insights in reinforcement learning. Hado van Hasselt, 2011.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018.
    Google ScholarFindings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
    Google ScholarLocate open access versionFindings
作者
Elena Smirnova
Elena Smirnova
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科